#1826 sssd etas 99% CPU and runs out of file descriptors when clearing cache
Closed: Fixed None Opened 8 years ago by jhrozek.

Ticket was cloned from Red Hat Bugzilla (product Red Hat Enterprise Linux 6): Bug 918394

Description of problem:
When we clear the sss-cache by using sss_cache -U, sss_cache -G, sss_cache -u
<login> the process sssd_nss takes each time some fds more. When the process
reaches its fd_limit, sssd runs at 99% CPU and the system gets unresponsive for
every user-related task.


Version-Release number of selected component (if applicable):
rpm -qa | grep sssd
sssd-tools-1.9.2-82.el6.x86_64
sssd-client-1.9.2-82.el6.x86_64
sssd-1.9.2-82.el6.x86_64


How reproducible:
Everytime we run

sss_cache -U or sss_cache -u <login>

the number of open files increases up to the fd_limit. Then, sssd runs at 99%
CPU and no nss is working anymore...

Steps to Reproduce:
1. service sssd start #start service
2. watch "lsof -p `ps -ef | grep sssd_nss | grep -v grep |  perl -l -a -n
-F"\s+" -e 'print $F[1]'` | wc -l" #watch fds
3. sss_cache -U #clear cache several times and watch the number of fds

Actual results:
Increasing number of fds for the sssd_nss process

Expected results:
Constant number of fds for the sssd_nss process


Additional info:
The leaking fds are all pointing to this files, lsof output:

sssd_nss 2090 root 8176u   REG                8,1   6806312    3424241
/var/lib/sss/mc/passwd (deleted)
sssd_nss 2090 root 8177u   REG                8,1   5206312    3424243
/var/lib/sss/mc/group (deleted)
sssd_nss 2090 root 8178u   REG                8,1   6806312    3424242
/var/lib/sss/mc/passwd (deleted)
sssd_nss 2090 root 8179u   REG                8,1   5206312    3424245
/var/lib/sss/mc/group (deleted)
sssd_nss 2090 root 8180u   REG                8,1   6806312    3424247
/var/lib/sss/mc/passwd (deleted)
sssd_nss 2090 root 8181u   REG                8,1   6806312    3424244
/var/lib/sss/mc/passwd (deleted)
sssd_nss 2090 root 8182u   REG                8,1   5206312    3424246
/var/lib/sss/mc/group (deleted)
sssd_nss 2090 root 8183u   REG                8,1   5206312    3424248
/var/lib/sss/mc/group (deleted)
sssd_nss 2090 root 8184u   REG                8,1   5206312    3424250
/var/lib/sss/mc/group (deleted)
sssd_nss 2090 root 8185u   REG                8,1   6806312    3424251
/var/lib/sss/mc/passwd (deleted)
sssd_nss 2090 root 8186u   REG                8,1   5206312    3424252
/var/lib/sss/mc/group (deleted)
sssd_nss 2090 root 8187u   REG                8,1   6806312    3424253
/var/lib/sss/mc/passwd (deleted)
sssd_nss 2090 root 8188u   REG                8,1   5206312    3424254
/var/lib/sss/mc/group (deleted)
sssd_nss 2090 root 8189u   REG                8,1   6806312   11493377
/var/lib/sss/mc/passwd (deleted)
sssd_nss 2090 root 8190u   REG                8,1   6806312    3424255
/var/lib/sss/mc/passwd (deleted)
sssd_nss 2090 root 8191u   REG                8,1   5206312    3424256
/var/lib/sss/mc/group (deleted)


The reason for the CPU usage is the error handling after epoll_wait(), strace
output:

epoll_wait(5, {{EPOLLIN, {u32=24633616, u64=24633616}}}, 1, 40403) = 1
accept(23, 0x149b38e0, [110])           = -1 EMFILE (Too many open files)
epoll_wait(5, {{EPOLLIN, {u32=24633616, u64=24633616}}}, 1, 40403) = 1
accept(23, 0x149b38e0, [110])           = -1 EMFILE (Too many open files)
epoll_wait(5, {{EPOLLIN, {u32=24633616, u64=24633616}}}, 1, 40403) = 1
accept(23, 0x149b38e0, [110])           = -1 EMFILE (Too many open files)
epoll_wait(5, {{EPOLLIN, {u32=24633616, u64=24633616}}}, 1, 40403) = 1
accept(23, 0x149b38e0, [110])           = -1 EMFILE (Too many open files)
epoll_wait(5, {{EPOLLIN, {u32=24633616, u64=24633616}}}, 1, 40403) = 1
accept(23, 0x149b38e0, [110])           = -1 EMFILE (Too many open files)
epoll_wait(5, {{EPOLLIN, {u32=24633616, u64=24633616}}}, 1, 40403) = 1
accept(23, 0x149b38e0, [110])           = -1 EMFILE (Too many open files)
epoll_wait(5, {{EPOLLIN, {u32=24633616, u64=24633616}}}, 1, 40376) = 1

Workaround:
We set the fd_limit in the [nss] section of sssd.conf to a much too high value
and restart sssd with our NMS when it approaches the limit.

[nss]
entry_negative_timeout = 0
debug_level = 0x1310
fd_limit=200000

This is not yet fixed in the packages in this repo

[sssd-1.9-RHEL6.3]
name=SSSD 1.9.x built for latest stable RHEL
baseurl=http://repos.fedorapeople.org/repos/jhrozek/sssd/epel-6/$basearch/
enabled=1
skip_if_unavailable=1
gpgcheck=0

Fields changed

blockedby: =>
blocking: =>
coverity: =>
design: =>
design_review: => 0
feature_milestone: =>
fedora_test_page: =>
owner: somebody => mzidek
selected: =>
testsupdated: => 0

Fields changed

patch: 0 => 1

Fields changed

milestone: NEEDS_TRIAGE => SSSD 1.9.5

resolution: => fixed
status: new => closed

Metadata Update from @jhrozek:
- Issue assigned to mzidek
- Issue set to the milestone: SSSD 1.9.5

4 years ago

SSSD is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in SSSD's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/SSSD/sssd/issues/2868

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Login to comment on this ticket.

Metadata