#1251 SSSD memory usage continuously growing
Closed: Fixed None Opened 7 years ago by mayak.

Dear SSSD team,

we are experiencing an issue with SSSD, where sssd_be is consuming a lot of memory. The RAM consumption grows continuously in a certain setup where a SFTP/SSH login happens every 30 seconds.

During a time period of 17 hours the memory usage of SSSD increased by 20% respectively 200MB (1024MB system memory).
After restarting SSSD the memory consumption goes back to normal.
\The attached debug log file shows the login sequence of the mentioned SFTP user.

Since we don't have a large LDAP directory (~60 Unix users / ~20 Unix groups) I suppose we might have a misconfiguration in our sssd.conf (see attachments).

\

When I remove the cache file of SSSD and run an id on 50 LDAP users the memory consumption grows only about ~5MB. The memory usage stays the same even when I run the id command over and over again (executed at least 20 times).
The commands getent passwd and getent group also do not increase the memory usage of SSSD.

\

Information about the environment/system:
- LDAP is ID and AUTH provider
- LDAP schema is rfc2307bis
- RHEL 6.2 / CentOS 6.2
- sssd-client-1.5.1-66.el6_2.3.i686
- sssd-1.5.1-66.el6_2.3.i686

Attachments:
- sssd.conf (sanitized)
- sssd_EXAMPLE.log (sanitized)
- sssd_mem_usage.png (graph)

\

If you need any further debug information please let me know.\Many thanks for looking into this issue.

Kind regards,
mayak


sssd_mem_usage.png (graph)
sssd_mem_usage.png

Do you happen to know which particular process consumes the memory? SSSD spawns several processes - with your configuration that would be sssd, sssd_nss, sssd_pam and sssd_be.

Replying to [comment:1 jhrozek]:

Do you happen to know which particular process consumes the memory? SSSD spawns several processes - with your configuration that would be sssd, sssd_nss, sssd_pam and sssd_be.

The original report reads: "where sssd_be is consuming a lot of memory".

component: SSSD => Data Provider
milestone: NEEDS_TRIAGE => SSSD 1.8.2 (LTM)
priority: major => critical

Fields changed

owner: somebody => jhrozek

Fields changed

status: new => assigned

daily memory growth (cronjob restarts sssd)
memutil.png

In case you were waiting for my confirmation - yes, the process is called sssd_be.
\
Sorry for the late reply. Thanks for your efforts.

_comment0: In case you were waiting for my confirmation - yes, the process is called sssd_be.
Thank you. => 1332319702347356
_comment1: In case you were waiting for my confirmation - yes, the process is called sssd_be.
\
Sorry for the late reply. Thank for your efforts. => 1332319721689213

Replying to [comment:5 mayak]:

In case you were waiting for my confirmation - yes, the process is called sssd_be.
\
Sorry for the late reply. Thanks for your efforts.

Thank you, I'm looking into the issue, but so far I've been unable to reproduce a memory growth in my test environment. I'll run another round of tests today.

Replying to [comment:6 jhrozek]:

Thank you, I'm looking into the issue, but so far I've been unable to reproduce a memory growth in my test environment. I'll run another round of tests today.
I have to admit that I am also unable to reproduce this issue under '//normal//' circumstances. It only appears on this one host, where a proprietary software establishes a SFTP connection every 30 seconds. When I manually open several SFTP connections with WinSCP/SCP the memory consumption of sssd_be does not increase.
\
Please let me know if I have to provide more valuable information or need to debug something for you.
\
Thank you.

Replying to [comment:7 mayak]:

Replying to [comment:6 jhrozek]:

Thank you, I'm looking into the issue, but so far I've been unable to reproduce a memory growth in my test environment. I'll run another round of tests today.
I have to admit that I am also unable to reproduce this issue under '//normal//' circumstances. It only appears on this one host, where a proprietary software establishes a SFTP connection every 30 seconds. When I manually open several SFTP connections with WinSCP/SCP the memory consumption of sssd_be does not increase.
\
Please let me know if I have to provide more valuable information or need to debug something for you.
\
Thank you.

Would you be willing to run valgrind for us and send us the results? You can do this by installing valgrind (via yum) and then adding the following line to your sssd.conf in the [domain/EXAMPLE] section (substituting EXAMPLE with your actual SSSD domain name):

command = /usr/bin/valgrind --log-file=/tmp/EXAMPLE-grind.%p.log /usr/libexec/sssd/sssd_be --domain EXAMPLE --debug-to-files

Run 'service sssd restart' and note the PID of the sssd_be process that is running (using ps -ef |grep sssd_be). Run the test for a few minutes, then do 'service sssd stop' and attach the /tmp/EXAMPLE-grind.<pid>.log to this ticket.

Replying to [comment:8 sgallagh]:

Would you be willing to run valgrind for us and send us the results?

Yes, I will try to provide you those debug information.\
I will need a few minutes/hours. \ Thanks for your patience.

Please find the generated EXAMPLE-grind.<PID>.log in the attachments. The sssd daemon was running for about 5 minutes with this additional command. The command of the valgrind process was called (ps -ef):
\
/usr/bin/valgrind --log-file=/tmp/EXAMPLE-grind.%p.log /usr/libexec/sssd/sssd_be --domain EXAMPLE --debug-to-files
\
\
I hope this helps. Many thanks.

_comment0: Please find the generated EXAMPLE-grind.<PID>.log in the attachments. The sssd daemon was running for about 5 minutes with this additional command. The command of the valgrind process was called (ps -ef):
\
/usr/bin/valgrind --log-file=/tmp/GSCF-grind.%p.log /usr/libexec/sssd/sssd_be --domain EXAMPLE --debug-to-files
\
\
I hope this helps. Many thanks. => 1332424207357672

Replying to [comment:10 mayak]:

Please find the generated EXAMPLE-grind.<PID>.log in the attachments. The sssd daemon was running for about 5 minutes with this additional command. The command of the valgrind process was called (ps -ef):
\
/usr/bin/valgrind --log-file=/tmp/EXAMPLE-grind.%p.log /usr/libexec/sssd/sssd_be --domain EXAMPLE --debug-to-files
\
\
I hope this helps. Many thanks.

Hi,
sorry, but you also need to add --leak-check=full into the list of the valgrind options. Without this switch, valgrind only reports memory access issues (such as use-after-free), but not the leaks.

Sorry about the inconvenience. We would still very much appreciate the data.

btw, I've seen the invalid file descriptor message during my testing with 1.5.x, but never with 1.8

Dear jhrozek, please find the requested log file in the attachments. This time I run valgrind with the --leak-check=full option. The sssd service was running at least 5 minutes.
\
Let me know if you need more information. Thanks.

valgrind output (--leak-check=full)
EXAMPLE-grind.13700.log

That's quite helpful, thank you. These are the two biggest leaks:

==13700== 163,300 (800 direct, 162,500 indirect) bytes in 10 blocks are definitely lost in loss record 560 of 564
==13700==    at 0x40053B3: calloc (vg_replace_malloc.c:467)
==13700==    by 0xCBBD00: ber_memcalloc_x (in /lib/liblber-2.4.so.2.5.6)
==13700==    by 0x12249F: ldap_send_server_request (in /lib/libldap-2.4.so.2.5.6)
==13700==    by 0x123323: ldap_chase_v3referrals (in /lib/libldap-2.4.so.2.5.6)
==13700==    by 0x10D4EB: ldap_result (in /lib/libldap-2.4.so.2.5.6)
==13700==    by 0x485E262: ??? (in /usr/lib/sssd/libsss_ldap.so.1.0.0)
==13700==    by 0xB71E8A: ??? (in /usr/lib/libtevent.so.0.9.8)
==13700==    by 0xB74124: ??? (in /usr/lib/libtevent.so.0.9.8)
==13700==    by 0xB70F17: _tevent_loop_once (in /usr/lib/libtevent.so.0.9.8)
==13700==    by 0xB70FAE: ??? (in /usr/lib/libtevent.so.0.9.8)
==13700==    by 0xB70C88: _tevent_loop_wait (in /usr/lib/libtevent.so.0.9.8)
==13700==    by 0x8080D1C: server_loop (in /usr/libexec/sssd/sssd_be)
==13700==
==13700== 987,424 (4,720 direct, 982,704 indirect) bytes in 59 blocks are definitely lost in loss record 564 of 564
==13700==    at 0x40053B3: calloc (vg_replace_malloc.c:467)
==13700==    by 0xCBBD00: ber_memcalloc_x (in /lib/liblber-2.4.so.2.5.6)
==13700==    by 0x12249F: ldap_send_server_request (in /lib/libldap-2.4.so.2.5.6)
==13700==    by 0x123323: ldap_chase_v3referrals (in /lib/libldap-2.4.so.2.5.6)
==13700==    by 0x10D4EB: ldap_result (in /lib/libldap-2.4.so.2.5.6)
==13700==    by 0x485E262: ??? (in /usr/lib/sssd/libsss_ldap.so.1.0.0)
==13700==    by 0xB74262: ??? (in /usr/lib/libtevent.so.0.9.8)
==13700==    by 0xB70F17: _tevent_loop_once (in /usr/lib/libtevent.so.0.9.8)
==13700==    by 0xB70FAE: ??? (in /usr/lib/libtevent.so.0.9.8)
==13700==    by 0xB70C88: _tevent_loop_wait (in /usr/lib/libtevent.so.0.9.8)
==13700==    by 0x8080D1C: server_loop (in /usr/libexec/sssd/sssd_be)
==13700==    by 0x8055BB8: main (in /usr/libexec/sssd/sssd_be)

May I guess that your LDAP server is Microsoft Active Directory? I see memory leaks during referral chasing from the valgrind log and MSAD utilizes quite a few of those.

If your environment does not use referrals, can you check if setting ldap_referrals = false makes the memory consumption better?

Also there is a big number of leaks coming from moznss/nspr. I saw those when I tested on RHEL6, but not Fedora. I will follow up with the openldap maintainers to see if they are aware of any leaks.

There is also a resolver related memory leak that is already fixed in 6.3. That one shouldn't be a problem, though because hostname resolution is a relatively rare operation.

Replying to [comment:13 jhrozek]:

May I guess that your LDAP server is Microsoft Active Directory? I see memory leaks during referral chasing from the valgrind log and MSAD utilizes quite a few of those.

That's right. We use MSAD as directory server.

Replying to [comment:13 jhrozek]:

If your environment does not use referrals, can you check if setting ldap_referrals = false makes the memory consumption better?

For testing purposes I have added the option ldap_referrals = false to sssd.conf on this one host. I don't think this is a coincidence but the execution time of the id command (on about 50 LDAP users) was much quicker!

Some additional testing:

I cleaned the cache file of sssd and run the id command (50 users) with ldap_referrals = true the process sssd_be was consuming 1.6% of the system memory (~16 MB).

When I did the same with ldap_referrals = false the process sssd_be was consuming only 0.5% of the system memory (~5 MB).

To check if the option ldap_referrals = false makes the memory consumption better I will keep sssd running this way until next Monday (~2 days). I will keep you updated.
\
Thank you for all hints and efforts.

Replying to [comment:14 mayak]:

To check if the option ldap_referrals = false makes the memory consumption better I will keep sssd running this way until next Monday (~2 days).

The option ldap_referrals = false massively improved the memory consumption of sssd_be.
Logins and LDAP queries are also faster. I will keep this new configuration for SSSD.
\
Please let me know if you need more details. Many thanks.

memory utilization with 'ldap_referrals = false'
ldap_referrals-false.png

Replying to [comment:15 mayak]:

Replying to [comment:14 mayak]:

To check if the option ldap_referrals = false makes the memory consumption better I will keep sssd running this way until next Monday (~2 days).

The option ldap_referrals = false massively improved the memory consumption of sssd_be.
Logins and LDAP queries are also faster. I will keep this new configuration for SSSD.
\
Please let me know if you need more details. Many thanks.

As both the graph and the valgrind log show, there seems to be a memory leak somewhere in the referral support of openldap libraries. I will follow with the openldap maintainer to check if this is a known issue.

Also, Stephen found a memory leak in SSSD's TLS setup, which might have contributed to the growth. Those will be fixed in the next SSSD release.

The memory leak in openldap was not known to the openldap maintainer and is now being tracked in https://bugzilla.redhat.com/show_bug.cgi?id=807363

The TLS leak was fixed in:

The openldap referral memory leak is being tracked in the Red Hat Bugzilla now.

patch: 0 => 1

Marking as complete. The openldap issue is out of our control.

resolution: => fixed
status: assigned => closed

Metadata Update from @mayak:
- Issue assigned to jhrozek
- Issue set to the milestone: SSSD 1.8.2 (LTM)

2 years ago

Login to comment on this ticket.

Metadata
Attachments 3
Attached 7 years ago View Comment
Attached 7 years ago View Comment
Attached 7 years ago View Comment