#2909 extreme memory usage in libnfsidmap sss.so plug-in when resolving groups with many members
Closed: Fixed None Opened 8 years ago by jhrozek.

Ticket was cloned from Red Hat Bugzilla (product Red Hat Enterprise Linux 7): Bug 1292238

Description of problem:

We are testing a RHEL7-based NFSv4 server and NFSv4 client in our
infrastructure, in order to provide NFS-based /home directories on the client
that are auto-mounted from the server on-demand. Both the server and the client
share these common features:

* The server and client are VMware virtual machines, with 2GiB of memory each.

* They are joined to our Microsoft Active Directory domain (via "net ads
join").

* They use the AD KDC as their Kerberos KDCs (in /etc/krb5.conf).

* They run sssd, and sssd provides the (nss, pac, pam) services.

* All users and groups are provided using the sss nsswitch plug-in.

* We use the sss.so plug-in in our idmapd.conf plug-in.

The client mounts the server with the nfserv=4.2 and sec=krb5p options.

We discovered a few bugs during the initial setup (see bug 1283341) that
suggests we are "pushing the envelope" in terms of our direct integration with
Active Directory.

Specifically, I suspect not many sites are using the libnfsidmap sss.so plug-in
(at least, not yet).

On our first day of testing with our development team, the NFS server has
invoked the OOM killer 4 times, and the NFS client has invoked the OOM killer
once.

For the server, each time the OOM killer was invoked, it was triggered by the
rpc.idmapd process. During normal operation, rpc.idmapd has about 32MiB of
total memory usage, and about 1MiB of resident memory:

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
27766 root      20   0   33708   1036    812 S  0.0  0.1   0:00.00 rpc.idmapd

But when the OOM killer is invoked, the memory usage of rpc.idmapd has jumped
to almost 1GiB of total memory, with almost 512MiB resident. Here's the line
from the OOM killer report:

[ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[ 1279]     0  1279   918202   423856    1794   485969             0 rpc.idmapd

Because of the extreme memory usage of rpc.idmapd, the OOM killer selects it
for termination, and memory usage recovers.

On the client, we've only seen the OOM killer invoked once so far, but the
process that triggered it was nfsidmap. Here's the line from the OOM killer
report:

[ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[16323]     0 16323   916107   424365    1792   485464             0 nfsidmap

As with rpc.idmapd on the server, because of the extreme memory usage of
nfsidmap, the OOM killer selects it for termination, and memory usage recovers.

The common element between both rpc.idmapd and nfsidmap is that they load the
sss.so plug-in. Therefore, we highly suspect that this extreme memory usage
occurs in the sss.so plug-in.

Version-Release number of selected component (if applicable):

0:gssproxy-0.4.1-7.el7.x86_64
0:libnfsidmap-0.25-12.el7.x86_64
0:libsss_idmap-1.13.0-40.el7.x86_64
0:libsss_nss_idmap-1.13.0-40.el7.x86_64
0:python-sssdconfig-1.13.0-40.el7.noarch
0:sssd-1.13.0-40.el7.x86_64
0:sssd-ad-1.13.0-40.el7.x86_64
0:sssd-client-1.13.0-40.el7.x86_64
0:sssd-common-1.13.0-40.el7.x86_64
0:sssd-common-pac-1.13.0-40.el7.x86_64
0:sssd-ipa-1.13.0-40.el7.x86_64
0:sssd-krb5-1.13.0-40.el7.x86_64
0:sssd-krb5-common-1.13.0-40.el7.x86_64
0:sssd-ldap-1.13.0-40.el7.x86_64
0:sssd-proxy-1.13.0-40.el7.x86_64

The client is using kernel 3.10.0-327.el7.x86_64; the server is using
3.10.0-327.el7.local.2.x86_64.  (The only change the .local.2 kernel adds is
that it contains the kernel gss patch in bug 1283341.)

How reproducible:

I do not know the specific circumstances that trigger the extreme memory usage
in rpc.idmapd / nfsidmap, but we seem to be able to trigger it fairly easily on
the server.

Additional info:

I don't know whether the extreme memory usage of rpc.idmapd/nfsidmap would
recover. Meaning, if our VMs has 8GiB of memory instead of 2GiB, so that
rpc.idmapd/nfsidmap could consume more than ~1GiB of memory before triggering
the OOM killer, would they recover on their after allocating, say, 2GiB of
memory?

(I suspect the answer is "no".  Given that normal operation consumes ~1MiB of
resident memory, I suspect that whatever memory consumption is being triggered
is essentially an infinite loop, and that the process will consume memory until
memory exhaustion is reached. And even if the processes would recover at some
memory usage point, suddenly jumping from ~1MiB resident to ~512MiB resident is
unacceptable behavior.)

We do have a temporary work-around for this behavior: since both the server and
the client are identical, and obtain the same passwd/group entries (via sssd),
we can use the nsswitch.so plugin on both the client and the server, and still
have name/id translation work. (We are testing that now, and at least so far,
we haven't seen the OOM killer invoked on either the server or the client.)

BUT: we have Linux NFSv4 clients that will need to use NFSv4/krb5/AD servers
that are not Linux-based. Those clients *must* use the sss plug-in for idmapd
in order for NFSv4 name/id translation to work. So if the sss plug-in for
idmapd is what is causing the extreme memory usage in rpc.idmapd and nfsidmap
(and the data so far strongly suggest that is the case), we absolutely need a
way to prevent that from happening.

Sumit has a patch, assigning to him..

blockedby: =>
blocking: =>
changelog: =>
coverity: =>
design: =>
design_review: => 0
feature_milestone: =>
fedora_test_page: =>
mark: no => 0
owner: somebody => sbose
review: True => 0
selected: =>
testsupdated: => 0

Fields changed

patch: 0 => 1
status: new => assigned

Since the ticket is fixed and there is a downstream clone, I think it's safe to mark as closed.

Even though there is also a sssd-1-12 commit, I'm going to move the ticket into 1.13.x as it's unclear when/if we'll do another 1.12 release.

milestone: NEEDS_TRIAGE => SSSD 1.13.4
resolution: => fixed
status: assigned => closed

Metadata Update from @jhrozek:
- Issue assigned to sbose
- Issue set to the milestone: SSSD 1.13.4

7 years ago

SSSD is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in SSSD's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/SSSD/sssd/issues/3950

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Login to comment on this ticket.

Metadata