Issue #478: SSSD is *much* slower than nss_ldap - sssd

SSSD / sssd

#478 SSSD is much slower than nss_ldap

Closed: Fixed None Opened 13 years ago by jdieter.

I work at a school where we have roughly 100 Fedora 11 boxes, but almost 1200 users. We use LDAP for authentication and authorization, and it's worked quite well.

The home directories are in /networld/usershare/users, and "time ls -l" on a freshly booted Fedora 11 box using nss_ldap gives me:

real    0m2.418s
user    0m0.097s
sys 0m0.162s

Running the same command on Fedora 13 using SSSD gives me:

real    7m50.658s
user    0m0.080s
sys 0m0.150s

Obviously from 2s -> almost 8m is quite the regression, and way too long to wait the first time the system boots.

Changing both enumerated and cache_credentials to false doesn't have any significant effect on the time it takes to run the command.

sgallagh commented 13 years ago

I do have to note, you have an invalid configuration in that sssd.conf. You should not use {{{ldap_id_use_start_tls = True }}} simultaneously with an {{{ldaps://}}} URI. They are in conflict, and may be causing your performance issues.

Also, you make mention of "the first time the system boots". Can you provide the numbers from a second run of {{{time ls -l}}} on SSSD? The cached results should return much more quickly.

Also, by default we set an {{{entry_cache_timeout}}} of 5400 seconds (90 minutes). So if you're waiting 90 minutes between running these tests, the cache would have aged out. Cache entries persist across reboots.

You may also want to look into the {{{entry_cache_nowait_percentage}}} option. This allows you to enable a rolling update of users that are being utilized more often, so that their cache entries are updated in the background. This avoids the lag out to LDAP to refresh the cache.

description: I work at a school where we have roughly 100 Fedora 11 boxes, but almost 1200 users. We use LDAP for authentication and authorization, and it's worked quite well.

The home directories are in /networld/usershare/users, and "time ls -l" on a freshly booted Fedora 11 box using nss_ldap gives me:

real 0m2.418s
user 0m0.097s
sys 0m0.162s

Running the same command on Fedora 13 using SSSD gives me:

real 7m50.658s
user 0m0.080s
sys 0m0.150s

Obviously from 2s -> almost 8m is quite the regression, and way too long to wait the first time the system boots.

Changing both enumerated and cache_credentials to false doesn't have any significant effect on the time it takes to run the command. => I work at a school where we have roughly 100 Fedora 11 boxes, but almost 1200 users. We use LDAP for authentication and authorization, and it's worked quite well.

The home directories are in /networld/usershare/users, and "time ls -l" on a freshly booted Fedora 11 box using nss_ldap gives me:
{{{
real 0m2.418s
user 0m0.097s
sys 0m0.162s
}}}
Running the same command on Fedora 13 using SSSD gives me:
{{{
real 7m50.658s
user 0m0.080s
sys 0m0.150s
}}}
Obviously from 2s -> almost 8m is quite the regression, and way too long to wait the first time the system boots.

Changing both enumerated and cache_credentials to false doesn't have any significant effect on the time it takes to run the command.

sgallagh commented 13 years ago

Oh, and for the record, {{{enumerate=true}}} is not advised, as it is notably slower than direct entry lookups.

jdieter commented 13 years ago

new sssd.conf
sssd.conf

jdieter commented 13 years ago

Ok, I've removed the line that says {{{ldap_id_use_start_tls = True}}} and changed enumerate to false.

The timings:

real    7m39.326s
user    0m0.055s
sys 0m0.174s

Second time is cached and much much better:

real    0m1.869s
user    0m0.049s
sys 0m0.113s

After a reboot, the time is:

real    0m1.610s
user    0m0.049s
sys 0m0.118s

Will it go back to an eight minute delay if I wait 90 minutes?

sgallagh commented 13 years ago

Would you mind running one more test for me?

First, stop SSSD and delete {{{/var/lib/sssd/db/cache_default.ldb}}}. (This will purge your cache).

Then restart the SSSD and run {{{time ls -ld /networld/usershare/users/<valid_username>}}}

This will test the time it's taking us to run an individual lookup. If this is greater than half a second, then there's probably some latency between the SSSD and the LDAP server.

Also, I should mention that your results with nss_ldap may not have been strictly accurate, as you were probably running with nscd enabled, which tends to have more aggressive caching.

To answer your question about the eight-minute delay. If you wait 90 minutes, all of the entries would time out, yes. One thing you can add to the {{{[nss]}}} section of the sssd.conf would be {{{entry_cache_nowait_percentage=50}}}. This means that if a user is looked up after 50% of the cache timeout is expired, the SSSD will go and update the cache behind the scenes (continuing to return the user from cache in the meantime). This will result in a rolling extension of the cache timeouts for those users that are actually logged in.

jhrozek commented 13 years ago

Fields changed

cc: => jhrozek

jdieter commented 13 years ago

Ok, removed /var/lib/sss/db/* and rebooted. {{{ls -l /networld/usershare/users/jd001}}}:

real    0m0.196s
user    0m0.004s
sys 0m0.011s

FWIW, I disabled nscd on one of the Fedora 11 boxes and {{{ls -l /networld/usershare/users}}}:

real    0m5.946s
user    0m0.272s
sys 0m0.295s

The problem with {{{entry_cache_nowait_percentage}}} is that, if I understand it correctly, it's most effective if you consistently have the same user(s) logged in. In our computer labs, we may have fifteen different students logged in to each computer over the course of the day.

Also, FWIW, {{{time ldapsearch -x -b ou=Users,dc=lesbg,dc=loc}}} gives me:

real    0m1.378s
user    0m0.145s
sys 0m0.136s

sgallagh commented 13 years ago

Yes, the nowait feature is meant to make sure that active users don't get hit by cache misses.

It definitely looks like there's a performance hit when we're asked to look up a large number of users all at the same time. Is this a common use-case, that people would be looking at directories of files many different ownerships?

I'm very surprised that you were seeing this with {{{enumerate=true}}}, since that should pre-cache all of your users for you. It's possible you were just running your test before the full enumeration run completed.

I suspect that if you purge your cache, set {{{enumerate=true}} and then restart the SSSD, if you wait ten minutes that your cache would have been primed. It still shouldn't take that long, but given that your ldapsearch command above took almost 1.4s, that's a pretty slow LDAP server connection.

Also, you can add "debug_level = 5" or higher to your [domain/default] section and check whether something is causing the enumeration to return a failure.

jdieter commented 13 years ago

Replying to [comment:7 sgallagh]:

It definitely looks like there's a performance hit when we're asked to look up a large number of users all at the same time. Is this a common use-case, that people would be looking at directories of files many different ownerships?

It's not very common for the students, but I do list {{{/networld/usershare/users}}} quite often. Granted, this is an edge case, but it is one I hit.

I'm very surprised that you were seeing this with {{{enumerate=true}}}, since that should pre-cache all of your users for you. It's possible you were just running your test before the full enumeration run completed.

I suspect that if you purge your cache, set {{{enumerate=true}}} and then restart the SSSD, if you wait ten minutes that your cache would have been primed. It still shouldn't take that long, but given that your ldapsearch command above took almost 1.4s, that's a pretty slow LDAP server connection.

That's definitely true. I cleared the cache, restarted the computer and then did my timed ls immediately after the computer rebooted. So here's the next question. If I set {{{enumerate=true}}}, what happens when the cache expires? If I try to do an {{{ls -l /networld/usershare/users}}} while it's refreshing the cache in the background, will it use the expired cached data or will it block until the cache is updated?

As for the LDAP server being slow, that's probably completely true. The server is running on a virtual machine with only 256MB of RAM, so I'm not hugely surprised that it's a bit slow. On the other hand, it's been fast enough for us until now, and I'd rather not upgrade hardware unless there's no other choice.

Also, you can add "debug_level = 5" or higher to your [domain/default] section and check whether something is causing the enumeration to return a failure.

Sounds like a good idea to check this. I'll get back to you on it.

jdieter commented 13 years ago

Last 50 lines of sssd_default.log
sssd_default.log

sgallagh commented 13 years ago

Replying to [comment:8 jdieter]:

That's definitely true. I cleared the cache, restarted the computer and then did my timed ls immediately after the computer rebooted. So here's the next question. If I set {{{enumerate=true}}}, what happens when the cache expires? If I try to do an {{{ls -l /networld/usershare/users}}} while it's refreshing the cache in the background, will it use the expired cached data or will it block until the cache is updated?

Actually, the enumeration code is pretty smart. It will regularly refresh itself to make sure no changes are needed (it does this by performing periodic checks against LDAP for entries that have changed since the last periodic check) and every so many hours it will do a full enumeration (returning cached data in the meantime) to make sure it hasn't diverged with the periodic checks.

So once the enumeration is finished, you shouldn't see cache misses causing blocking.

Also, the log you attached looks like you caught it in the middle of updating the cache with the results it got from LDAP. Might want to wait a little longer to see that it finishes completely.

As for the LDAP server being slow, that's probably completely true. The server is running on a virtual machine with only 256MB of RAM, so I'm not hugely surprised that it's a bit slow. On the other hand, it's been fast enough for us until now, and I'd rather not upgrade hardware unless there's no other choice.

Sorry, I wasn't implying that you needed new hardware, necessarily. Just suggesting that there may be network/CPU latency issues involved here as well.

jdieter commented 13 years ago

Yeah, I was aware that it was enumerating as I did the log. The thing I found interesting is that it's only getting six names a second. How does nss_ldap beat this? Does it grab names in parallel?

Thanks for the info on the enumeration code, that was exactly what I wanted to hear. Last question: If we enumerate the LDAP database, storing it in the local database and then make an image of the system, will any computers that are imaged with that image a month or two down the road continue to use the (extremely old) cache while they update it?

Thanks much for your swift help on this. I really appreciate your quick replies.

sgallagh commented 13 years ago

Replying to [comment:10 jdieter]:

Yeah, I was aware that it was enumerating as I did the log. The thing I found interesting is that it's only getting six names a second. How does nss_ldap beat this? Does it grab names in parallel?

Hmm, that's an interesting observation. I hadn't noticed that. We'll have to look into why that's bottlenecking there. It shouldn't be THAT slow.

Thanks for the info on the enumeration code, that was exactly what I wanted to hear. Last question: If we enumerate the LDAP database, storing it in the local database and then make an image of the system, will any computers that are imaged with that image a month or two down the road continue to use the (extremely old) cache while they update it?

I don't THINK it will, because the individual entries should still be governed by the entry_cache_timeout, which will cause them to go and do an individual entry lookup while the enumeration is running. But I'll be honest, that might require some testing that I haven't done yet.

sgallagh commented 13 years ago

Fields changed

component: SSSD => LDAP Provider
milestone: NEEDS_TRIAGE => SSSD 1.4.0

dpal commented 13 years ago

Fields changed

owner: somebody => jhrozek

dpal commented 13 years ago

Fields changed

priority: major => blocker

sgallagh commented 13 years ago

Additional investigation here has identified a serious performance bottleneck when performing initgroups of users who are members of large groups. We are recursively pulling down a lot of data that we don't really need for this action, and it's resulting in 100% CPU situations and long wait times for results.

dpal commented 13 years ago

Fields changed

milestone: SSSD 1.5.0 => SSSD 1.4.0

sgallagh commented 13 years ago

The priority of this issue has been raised. It's becoming a serious issue on many deployments.

This is now a blocker for 1.2.2.

milestone: SSSD 1.4.0 => SSSD 1.2.2
owner: jhrozek => sgallagh
status: new => assigned

sgallagh commented 13 years ago

Fixed by b47a7c2

fixedin: => 1.2.2
resolution: => fixed
status: assigned => closed
tests: 0 => 1

jgalipea commented 12 years ago

Fields changed

coverity: =>
patch: => 0
tests: 1 => 0
upgrade: => 0

dpal commented 12 years ago

Fields changed

rhbz: => 0

Metadata Update from @jdieter:
- Issue assigned to sgallagh
- Issue set to the milestone: SSSD 1.2.2

7 years ago

pbrezina commented 3 years ago

SSSD is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in SSSD's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/SSSD/sssd/issues/1520

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata

Assignee

sgallagh

Tags

None

Blocking

None

Depending on

None

Priority

blocker

Milestone

SSSD 1.2.2

type

defect

component

LDAP Provider

version

1.1.1

selected

None

testsupdated

patch

rhbz

design_review

None

review

None

changelog

None

keywords

None

coverity

None

mark

None

blocking

None

design

None

sensitive

None

jhrozek

blockedby

None

feature_milestone

None

SSSD / sssd

Source Code

Documentation

#478 SSSD is *much* slower than nss_ldap Closed: Fixed None Opened 13 years ago by jdieter.

Metadata

#478 SSSD is much slower than nss_ldap

Closed: Fixed None Opened 13 years ago by jdieter.