Learn more about these different git repos.
Other Git URLs
Ticket was cloned from Red Hat Bugzilla (product Red Hat Enterprise Linux 7): Bug 1292238
Description of problem: We are testing a RHEL7-based NFSv4 server and NFSv4 client in our infrastructure, in order to provide NFS-based /home directories on the client that are auto-mounted from the server on-demand. Both the server and the client share these common features: * The server and client are VMware virtual machines, with 2GiB of memory each. * They are joined to our Microsoft Active Directory domain (via "net ads join"). * They use the AD KDC as their Kerberos KDCs (in /etc/krb5.conf). * They run sssd, and sssd provides the (nss, pac, pam) services. * All users and groups are provided using the sss nsswitch plug-in. * We use the sss.so plug-in in our idmapd.conf plug-in. The client mounts the server with the nfserv=4.2 and sec=krb5p options. We discovered a few bugs during the initial setup (see bug 1283341) that suggests we are "pushing the envelope" in terms of our direct integration with Active Directory. Specifically, I suspect not many sites are using the libnfsidmap sss.so plug-in (at least, not yet). On our first day of testing with our development team, the NFS server has invoked the OOM killer 4 times, and the NFS client has invoked the OOM killer once. For the server, each time the OOM killer was invoked, it was triggered by the rpc.idmapd process. During normal operation, rpc.idmapd has about 32MiB of total memory usage, and about 1MiB of resident memory: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 27766 root 20 0 33708 1036 812 S 0.0 0.1 0:00.00 rpc.idmapd But when the OOM killer is invoked, the memory usage of rpc.idmapd has jumped to almost 1GiB of total memory, with almost 512MiB resident. Here's the line from the OOM killer report: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name [ 1279] 0 1279 918202 423856 1794 485969 0 rpc.idmapd Because of the extreme memory usage of rpc.idmapd, the OOM killer selects it for termination, and memory usage recovers. On the client, we've only seen the OOM killer invoked once so far, but the process that triggered it was nfsidmap. Here's the line from the OOM killer report: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name [16323] 0 16323 916107 424365 1792 485464 0 nfsidmap As with rpc.idmapd on the server, because of the extreme memory usage of nfsidmap, the OOM killer selects it for termination, and memory usage recovers. The common element between both rpc.idmapd and nfsidmap is that they load the sss.so plug-in. Therefore, we highly suspect that this extreme memory usage occurs in the sss.so plug-in. Version-Release number of selected component (if applicable): 0:gssproxy-0.4.1-7.el7.x86_64 0:libnfsidmap-0.25-12.el7.x86_64 0:libsss_idmap-1.13.0-40.el7.x86_64 0:libsss_nss_idmap-1.13.0-40.el7.x86_64 0:python-sssdconfig-1.13.0-40.el7.noarch 0:sssd-1.13.0-40.el7.x86_64 0:sssd-ad-1.13.0-40.el7.x86_64 0:sssd-client-1.13.0-40.el7.x86_64 0:sssd-common-1.13.0-40.el7.x86_64 0:sssd-common-pac-1.13.0-40.el7.x86_64 0:sssd-ipa-1.13.0-40.el7.x86_64 0:sssd-krb5-1.13.0-40.el7.x86_64 0:sssd-krb5-common-1.13.0-40.el7.x86_64 0:sssd-ldap-1.13.0-40.el7.x86_64 0:sssd-proxy-1.13.0-40.el7.x86_64 The client is using kernel 3.10.0-327.el7.x86_64; the server is using 3.10.0-327.el7.local.2.x86_64. (The only change the .local.2 kernel adds is that it contains the kernel gss patch in bug 1283341.) How reproducible: I do not know the specific circumstances that trigger the extreme memory usage in rpc.idmapd / nfsidmap, but we seem to be able to trigger it fairly easily on the server. Additional info: I don't know whether the extreme memory usage of rpc.idmapd/nfsidmap would recover. Meaning, if our VMs has 8GiB of memory instead of 2GiB, so that rpc.idmapd/nfsidmap could consume more than ~1GiB of memory before triggering the OOM killer, would they recover on their after allocating, say, 2GiB of memory? (I suspect the answer is "no". Given that normal operation consumes ~1MiB of resident memory, I suspect that whatever memory consumption is being triggered is essentially an infinite loop, and that the process will consume memory until memory exhaustion is reached. And even if the processes would recover at some memory usage point, suddenly jumping from ~1MiB resident to ~512MiB resident is unacceptable behavior.) We do have a temporary work-around for this behavior: since both the server and the client are identical, and obtain the same passwd/group entries (via sssd), we can use the nsswitch.so plugin on both the client and the server, and still have name/id translation work. (We are testing that now, and at least so far, we haven't seen the OOM killer invoked on either the server or the client.) BUT: we have Linux NFSv4 clients that will need to use NFSv4/krb5/AD servers that are not Linux-based. Those clients *must* use the sss plug-in for idmapd in order for NFSv4 name/id translation to work. So if the sss plug-in for idmapd is what is causing the extreme memory usage in rpc.idmapd and nfsidmap (and the data so far strongly suggest that is the case), we absolutely need a way to prevent that from happening.
Sumit has a patch, assigning to him..
blockedby: => blocking: => changelog: => coverity: => design: => design_review: => 0 feature_milestone: => fedora_test_page: => mark: no => 0 owner: somebody => sbose review: True => 0 selected: => testsupdated: => 0
Fields changed
patch: 0 => 1 status: new => assigned
master:
sssd-1-13:
sssd-1-12:
Since the ticket is fixed and there is a downstream clone, I think it's safe to mark as closed.
Even though there is also a sssd-1-12 commit, I'm going to move the ticket into 1.13.x as it's unclear when/if we'll do another 1.12 release.
milestone: NEEDS_TRIAGE => SSSD 1.13.4 resolution: => fixed status: assigned => closed
Metadata Update from @jhrozek: - Issue assigned to sbose - Issue set to the milestone: SSSD 1.13.4
SSSD is moving from Pagure to Github. This means that new issues and pull requests will be accepted only in SSSD's github repository.
This issue has been cloned to Github and is available here: - https://github.com/SSSD/sssd/issues/3950
If you want to receive further updates on the issue, please navigate to the github issue and click on subscribe button.
subscribe
Thank you for understanding. We apologize for all inconvenience.
Login to comment on this ticket.