#1826 sssd etas 99% CPU and runs out of file descriptors when clearing cache
Closed: Fixed None Opened 6 years ago by jhrozek.

Ticket was cloned from Red Hat Bugzilla (product Red Hat Enterprise Linux 6): Bug 918394

Description of problem:
When we clear the sss-cache by using sss_cache -U, sss_cache -G, sss_cache -u
<login> the process sssd_nss takes each time some fds more. When the process
reaches its fd_limit, sssd runs at 99% CPU and the system gets unresponsive for
every user-related task.


Version-Release number of selected component (if applicable):
rpm -qa | grep sssd
sssd-tools-1.9.2-82.el6.x86_64
sssd-client-1.9.2-82.el6.x86_64
sssd-1.9.2-82.el6.x86_64


How reproducible:
Everytime we run

sss_cache -U or sss_cache -u <login>

the number of open files increases up to the fd_limit. Then, sssd runs at 99%
CPU and no nss is working anymore...

Steps to Reproduce:
1. service sssd start #start service
2. watch "lsof -p `ps -ef | grep sssd_nss | grep -v grep |  perl -l -a -n
-F"\s+" -e 'print $F[1]'` | wc -l" #watch fds
3. sss_cache -U #clear cache several times and watch the number of fds

Actual results:
Increasing number of fds for the sssd_nss process

Expected results:
Constant number of fds for the sssd_nss process


Additional info:
The leaking fds are all pointing to this files, lsof output:

sssd_nss 2090 root 8176u   REG                8,1   6806312    3424241
/var/lib/sss/mc/passwd (deleted)
sssd_nss 2090 root 8177u   REG                8,1   5206312    3424243
/var/lib/sss/mc/group (deleted)
sssd_nss 2090 root 8178u   REG                8,1   6806312    3424242
/var/lib/sss/mc/passwd (deleted)
sssd_nss 2090 root 8179u   REG                8,1   5206312    3424245
/var/lib/sss/mc/group (deleted)
sssd_nss 2090 root 8180u   REG                8,1   6806312    3424247
/var/lib/sss/mc/passwd (deleted)
sssd_nss 2090 root 8181u   REG                8,1   6806312    3424244
/var/lib/sss/mc/passwd (deleted)
sssd_nss 2090 root 8182u   REG                8,1   5206312    3424246
/var/lib/sss/mc/group (deleted)
sssd_nss 2090 root 8183u   REG                8,1   5206312    3424248
/var/lib/sss/mc/group (deleted)
sssd_nss 2090 root 8184u   REG                8,1   5206312    3424250
/var/lib/sss/mc/group (deleted)
sssd_nss 2090 root 8185u   REG                8,1   6806312    3424251
/var/lib/sss/mc/passwd (deleted)
sssd_nss 2090 root 8186u   REG                8,1   5206312    3424252
/var/lib/sss/mc/group (deleted)
sssd_nss 2090 root 8187u   REG                8,1   6806312    3424253
/var/lib/sss/mc/passwd (deleted)
sssd_nss 2090 root 8188u   REG                8,1   5206312    3424254
/var/lib/sss/mc/group (deleted)
sssd_nss 2090 root 8189u   REG                8,1   6806312   11493377
/var/lib/sss/mc/passwd (deleted)
sssd_nss 2090 root 8190u   REG                8,1   6806312    3424255
/var/lib/sss/mc/passwd (deleted)
sssd_nss 2090 root 8191u   REG                8,1   5206312    3424256
/var/lib/sss/mc/group (deleted)


The reason for the CPU usage is the error handling after epoll_wait(), strace
output:

epoll_wait(5, {{EPOLLIN, {u32=24633616, u64=24633616}}}, 1, 40403) = 1
accept(23, 0x149b38e0, [110])           = -1 EMFILE (Too many open files)
epoll_wait(5, {{EPOLLIN, {u32=24633616, u64=24633616}}}, 1, 40403) = 1
accept(23, 0x149b38e0, [110])           = -1 EMFILE (Too many open files)
epoll_wait(5, {{EPOLLIN, {u32=24633616, u64=24633616}}}, 1, 40403) = 1
accept(23, 0x149b38e0, [110])           = -1 EMFILE (Too many open files)
epoll_wait(5, {{EPOLLIN, {u32=24633616, u64=24633616}}}, 1, 40403) = 1
accept(23, 0x149b38e0, [110])           = -1 EMFILE (Too many open files)
epoll_wait(5, {{EPOLLIN, {u32=24633616, u64=24633616}}}, 1, 40403) = 1
accept(23, 0x149b38e0, [110])           = -1 EMFILE (Too many open files)
epoll_wait(5, {{EPOLLIN, {u32=24633616, u64=24633616}}}, 1, 40403) = 1
accept(23, 0x149b38e0, [110])           = -1 EMFILE (Too many open files)
epoll_wait(5, {{EPOLLIN, {u32=24633616, u64=24633616}}}, 1, 40376) = 1

Workaround:
We set the fd_limit in the [nss] section of sssd.conf to a much too high value
and restart sssd with our NMS when it approaches the limit.

[nss]
entry_negative_timeout = 0
debug_level = 0x1310
fd_limit=200000

This is not yet fixed in the packages in this repo

[sssd-1.9-RHEL6.3]
name=SSSD 1.9.x built for latest stable RHEL
baseurl=http://repos.fedorapeople.org/repos/jhrozek/sssd/epel-6/$basearch/
enabled=1
skip_if_unavailable=1
gpgcheck=0

Fields changed

blockedby: =>
blocking: =>
coverity: =>
design: =>
design_review: => 0
feature_milestone: =>
fedora_test_page: =>
owner: somebody => mzidek
selected: =>
testsupdated: => 0

Fields changed

patch: 0 => 1

Fields changed

milestone: NEEDS_TRIAGE => SSSD 1.9.5

resolution: => fixed
status: new => closed

Metadata Update from @jhrozek:
- Issue assigned to mzidek
- Issue set to the milestone: SSSD 1.9.5

2 years ago

Login to comment on this ticket.

Metadata