#3468 SSSD doesn't use AD global catalog for gidnumber lookup, resulting in unacceptable delay for large forests
Closed: Fixed 4 years ago Opened 4 years ago by jhrozek.

Ticket was cloned from Red Hat Bugzilla (product Red Hat Enterprise Linux 7): Bug 1421194

Description of problem:
Customer has a large number of users on RHEL7 SSSD clients that are complaining
about extreme slowness in name service lookups. Investigations indicate that
SSSD does not appear to be using the global catalog for gidnumber lookups.

SSSD is trawling through all the domains in the forest (and other trusted
domains), rather that doing a global catalog search.

This is causing serious performance problems on RHEL7 clients in the production
AD forest.

Please can you explain under what circumstances the SSSD client is supposed to
use the GC and confirm whether the behaviour seen in my testing is as expected
or not.  Specifically could you provide information on what attributes are
searched for in the GC, whether this is effective for forward (eg
user->uidnumber) and reverse (eg uidnumber->user) lookups, and what is the
expected behaviour if a value for the attribute is not found in the GC.

Version-Release number of selected component (if applicable):
python-sssdconfig-1.14.0-43.el7.noarch
sssd-common-1.14.0-43.el7.x86_64
sssd-ipa-1.14.0-43.el7.x86_64
sssd-common-pac-1.14.0-43.el7.x86_64
sssd-ldap-1.14.0-43.el7.x86_64
sssd-1.14.0-43.el7.x86_64
sssd-krb5-common-1.14.0-43.el7.x86_64
sssd-krb5-1.14.0-43.el7.x86_64
sssd-ad-1.14.0-43.el7.x86_64
sssd-client-1.14.0-43.el7.x86_64
sssd-proxy-1.14.0-43.el7.x86_64

How reproducible:
Reliably.

Steps to Reproduce:

Test Performed:

# Stop SSSD, clear cache and logs, and restart

[root@oo-vm5 ~]# service sssd stop
Redirecting to /bin/systemctl stop  sssd.service
[root@oo-vm5 ~]# rm -rf /var/lib/sss/db/* /var/log/sssd/*
[root@oo-vm5 ~]# service sssd start
Redirecting to /bin/systemctl start  sssd.service

# Do a ls of a directory containing files with group owners from local domain,
trusted domains, and not existing at all.

[jw83113@oo-vm5 test]$ ls -lg
total 0
-rw-rw-r--. 1 unixeng-access-grp@eurdev.nsrootdev.net 0 Feb  3 04:12 eurgroup
-rw-rw-r--. 1                                 9999999 0 Feb  3 04:12
groupnotfound
-rw-rw-r--. 1 jwlacgrp2@lacdev.nsrootdev.net          0 Feb  3 09:53 lacgroup
-rw-rw-r--. 1 jw83113                                 0 Feb  3 04:12 namgroup

# Look in sssd_nss.log file to see what domains are being queried (log level
0x01F0)
#  It can be observed that SSSD is trawling domains until it finds the
gidnumber or worse when it does not exist, it tries every domain rather than
checking the GC.  (SSSD should know that gidnumber is replicated to GC from the
schema objects)

[root@oo-vm5 ~]# tail -f  /var/log/sssd/sssd_nss.log
[?]

(Mon Feb  6 13:14:55 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100):
Requesting info for [1200955690@NAMDEV.NSROOTDEV.NET]
(Mon Feb  6 13:14:55 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100):
Requesting info for [1200955690@NAMDEV.NSROOTDEV.NET]
(Mon Feb  6 13:14:55 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100):
Requesting info for [9999999@NAMDEV.NSROOTDEV.NET]
(Mon Feb  6 13:14:55 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100):
Requesting info for [9999999@NAMDEV.NSROOTDEV.NET]
(Mon Feb  6 13:14:55 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100):
Requesting info for [9999999@nsrootdev.net]
(Mon Feb  6 13:14:55 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100):
Requesting info for [9999999@nsrootdev.net]
(Mon Feb  6 13:14:55 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100):
Requesting info for [9999999@eurdev.nsrootdev.net]
(Mon Feb  6 13:14:55 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100):
Requesting info for [9999999@eurdev.nsrootdev.net]
(Mon Feb  6 13:14:55 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100):
Requesting info for [9999999@lacdev.nsrootdev.net]
(Mon Feb  6 13:14:55 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100):
Requesting info for [9999999@lacdev.nsrootdev.net]
(Mon Feb  6 13:14:55 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100):
Requesting info for [9999999@ausdev.nsrootdev.net]
(Mon Feb  6 13:14:55 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100):
Requesting info for [9999999@ausdev.nsrootdev.net]
(Mon Feb  6 13:14:55 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0080): No
matching domain found for [9999999]
(Mon Feb  6 13:14:55 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100):
Requesting info for [1250152112@NAMDEV.NSROOTDEV.NET]
(Mon Feb  6 13:14:55 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100):
Requesting info for [1250152112@NAMDEV.NSROOTDEV.NET]
(Mon Feb  6 13:14:55 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100):
Requesting info for [1250152112@nsrootdev.net]
(Mon Feb  6 13:14:55 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100):
Requesting info for [1250152112@nsrootdev.net]
(Mon Feb  6 13:14:55 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100):
Requesting info for [1250152112@eurdev.nsrootdev.net]
(Mon Feb  6 13:14:55 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100):
Requesting info for [1250152112@eurdev.nsrootdev.net]
(Mon Feb  6 13:14:55 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100):
Requesting info for [1290120733@NAMDEV.NSROOTDEV.NET]
(Mon Feb  6 13:14:55 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100):
Requesting info for [1290120733@NAMDEV.NSROOTDEV.NET]
(Mon Feb  6 13:14:55 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100):
Requesting info for [1290120733@nsrootdev.net]
(Mon Feb  6 13:14:55 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100):
Requesting info for [1290120733@nsrootdev.net]
(Mon Feb  6 13:14:55 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100):
Requesting info for [1290120733@eurdev.nsrootdev.net]
(Mon Feb  6 13:14:55 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100):
Requesting info for [1290120733@eurdev.nsrootdev.net]
(Mon Feb  6 13:14:55 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100):
Requesting info for [1290120733@lacdev.nsrootdev.net]
(Mon Feb  6 13:14:55 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100):
Requesting info for [1290120733@lacdev.nsrootdev.net]

# A TCP dump shows that SSSD does communicate with the GC but then goes to each
domain anyway.

13:14:55.561889 IP oo-vm5.nam.nsroot.net.56466 >
namdev003.nam.nsroot.net.msft-gc: Flags [P.], seq 2020:2222, ack 3596, win 321,
options [nop,nop,TS val 2718168430 ecr 113087555], length 202
13:14:55.580684 IP namdev003.nam.nsroot.net.msft-gc >
oo-vm5.nam.nsroot.net.56466: Flags [P.], seq 3596:3868, ack 2222, win 258,
options [nop,nop,TS val 113087586 ecr 2718168430], length 272
13:14:55.580698 IP oo-vm5.nam.nsroot.net.56466 >
namdev003.nam.nsroot.net.msft-gc: Flags [.], ack 3868, win 344, options
[nop,nop,TS val 2718168448 ecr 113087586], length 0
13:14:55.590267 IP oo-vm5.nam.nsroot.net.32812 > namdev003.nam.nsroot.net.ldap:
Flags [P.], seq 5116:5416, ack 31133, win 836, options [nop,nop,TS val
2718168458 ecr 113087583], length 300
13:14:55.590811 IP oo-vm5.nam.nsroot.net.32812 > namdev003.nam.nsroot.net.ldap:
Flags [P.], seq 5416:5716, ack 31219, win 836, options [nop,nop,TS val
2718168459 ecr 113087587], length 300
13:14:55.591781 IP oo-vm5.nam.nsroot.net.35076 >
nsrootdev001.nam.nsroot.net.ldap: Flags [P.], seq 3455:3745, ack 9334, win 406,
options [nop,nop,TS val 2718168460 ecr 247010140], length 290
13:14:55.592662 IP oo-vm5.nam.nsroot.net.35076 >
nsrootdev001.nam.nsroot.net.ldap: Flags [P.], seq 3745:4035, ack 9970, win 416,
options [nop,nop,TS val 2718168460 ecr 247010144], length 290
13:14:55.593851 IP oo-vm5.nam.nsroot.net.46548 > eurdev003.nam.nsroot.net.ldap:
Flags [P.], seq 2686:2986, ack 4045, win 321, options [nop,nop,TS val
2718168462 ecr 246862079], length 300
13:14:55.594750 IP oo-vm5.nam.nsroot.net.46770 > ausdev001.nam.nsroot.net.ldap:
Flags [.], ack 3466, win 299, options [nop,nop,TS val 2718168463 ecr
272910065], length 0
13:14:55.595372 IP oo-vm5.nam.nsroot.net.46548 > eurdev003.nam.nsroot.net.ldap:
Flags [P.], seq 2986:3286, ack 4131, win 321, options [nop,nop,TS val
2718168463 ecr 246862082], length 300
13:14:55.596562 IP oo-vm5.nam.nsroot.net.46992 > lacdev001.nam.nsroot.net.ldap:
Flags [P.], seq 2386:2686, ack 3466, win 299, options [nop,nop,TS val
2718168464 ecr 272909609], length 300
13:14:55.598262 IP oo-vm5.nam.nsroot.net.46992 > lacdev001.nam.nsroot.net.ldap:
Flags [.], ack 3948, win 321, options [nop,nop,TS val 2718168466 ecr
272909622], length 0
13:14:55.630765 IP oo-vm5.nam.nsroot.net.32812 > namdev003.nam.nsroot.net.ldap:
Flags [.], ack 31305, win 836, options [nop,nop,TS val 2718168499 ecr
113087587], length 0
13:14:55.632766 IP oo-vm5.nam.nsroot.net.35076 >
nsrootdev001.nam.nsroot.net.ldap: Flags [.], ack 10606, win 426, options
[nop,nop,TS val 2718168501 ecr 247010144], length 0
13:14:55.635786 IP oo-vm5.nam.nsroot.net.46548 > eurdev003.nam.nsroot.net.ldap:
Flags [.], ack 4217, win 321, options [nop,nop,TS val 2718168504 ecr
246862082], length 0

In the production AD forest this takes 18 seconds to search for a  single
non-existent group.  Once this goes into the negative entry cache (which I have
increased the time to live), the lookup is fast because it?s fulfilled from the
local cache but this initial lookup is unacceptably slow.  SSSD should do one
single search against the GC not trawl around all the domains.

(Mon Feb  6 13:20:24 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100):
Requesting info for [500008636@NAM.NSROOT.NET]
(Mon Feb  6 13:20:24 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100):
Requesting info for [500008636@NAM.NSROOT.NET]
(Mon Feb  6 13:20:24 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100):
Requesting info for [500008636@nsroot.net]
(Mon Feb  6 13:20:25 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100):
Requesting info for [500008636@nsroot.net]
(Mon Feb  6 13:20:25 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100):
Requesting info for [500008636@cardsprod.cards.citicorp.com]
(Mon Feb  6 13:20:27 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100):
Requesting info for [500008636@cardsprod.cards.citicorp.com]
(Mon Feb  6 13:20:27 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100):
Requesting info for [500008636@eur.nsroot.net]
(Mon Feb  6 13:20:27 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100):
Requesting info for [500008636@eur.nsroot.net]
(Mon Feb  6 13:20:27 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100):
Requesting info for [500008636@cbna.csdev.narb.citicorp.com]
(Mon Feb  6 13:20:29 2017) [sssd[nss]] [nss_cmd_getby_dp_callback] (0x0040):
Unable to get information from Data Provider
Error: 3, 1432158270, Group lookup failed
Will try to return what we have in cache
(Mon Feb  6 13:20:29 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100):
Requesting info for [500008636@jpn.nsroot.net]
(Mon Feb  6 13:20:31 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100):
Requesting info for [500008636@jpn.nsroot.net]
(Mon Feb  6 13:20:31 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100):
Requesting info for [500008636@lac.nsroot.net]
(Mon Feb  6 13:20:33 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100):
Requesting info for [500008636@lac.nsroot.net]
(Mon Feb  6 13:20:33 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100):
Requesting info for [500008636@REL.crest.dyn.nsroot.net]
(Mon Feb  6 13:20:35 2017) [sssd[nss]] [nss_cmd_getby_dp_callback] (0x0040):
Unable to get information from Data Provider
Error: 3, 1432158270, Group lookup failed
Will try to return what we have in cache
(Mon Feb  6 13:20:35 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100):
Requesting info for [500008636@bancocuscatlan.ubc]
(Mon Feb  6 13:20:38 2017) [sssd[nss]] [nss_cmd_getby_dp_callback] (0x0040):
Unable to get information from Data Provider
Error: 3, 1432158270, Group lookup failed
Will try to return what we have in cache
(Mon Feb  6 13:20:38 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100):
Requesting info for [500008636@aus.nsroot.net]
(Mon Feb  6 13:20:41 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100):
Requesting info for [500008636@aus.nsroot.net]
(Mon Feb  6 13:20:41 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100):
Requesting info for [500008636@nbs.csdev.narb.citicorp.com]
(Mon Feb  6 13:20:42 2017) [sssd[nss]] [nss_cmd_getby_dp_callback] (0x0040):
Unable to get information from Data Provider
Error: 3, 1432158270, Group lookup failed
Will try to return what we have in cache
(Mon Feb  6 13:20:42 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100):
Requesting info for [500008636@apac.nsroot.net]
(Mon Feb  6 13:20:45 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100):
Requesting info for [500008636@apac.nsroot.net]
(Mon Feb  6 13:20:45 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0080): No
matching domain found for [500008636]


Actual results:
SSSD goes to global catalog, but then trawls subdomains any way.

Expected results:
SSSD should always use global catalog first, and only trawl subdomains if data
cannot be found.

Additional info:
The performance problem is occuring in a production AD environment which is
very large (approx 400000 users), however the behaviour has been replicated in
a development AD environment too.  The delays in development are much smaller
because there are a lot fewer entries however the effects are the same in that
many domains are being searched individually rather than a single GC query.

When does the behavior occur? Frequently?  Repeatedly?   At certain times?

Always when an entry is not in the local SSSD cache.  It's repeatable.

What information can you provide around timeframes and the business impact?

This is an important case for the customer as it has visibility to production
end users.  They are managing the impact by increasing the SSSD negative entry
cache time to live to reduce the frequency of these lookups but it's not a root
cause fix.  They need a root cause fix for this problem.

Metadata Update from @jhrozek:
- Custom field rhbz adjusted to https://bugzilla.redhat.com/show_bug.cgi?id=1421194

4 years ago

Metadata Update from @jhrozek:
- Custom field rhbz adjusted to https://bugzilla.redhat.com/show_bug.cgi?id=1421194

4 years ago

Metadata Update from @jhrozek:
- Issue priority set to: major

4 years ago

Metadata Update from @jhrozek:
- Issue tagged with: performance

4 years ago

Metadata Update from @jhrozek:
- Issue assigned to jhrozek

4 years ago

Since we are required to release a new upstream tarball no later than Friday Oct-20, I'm moving tickets that will not be closed by that date to the next milestone, 1.16.1

Metadata Update from @jhrozek:
- Issue set to the milestone: SSSD 1.16.1 (was: SSSD 1.16.0)

4 years ago

Since we are required to release a new upstream tarball no later than Friday Oct-20, I'm moving tickets that will not be closed by that date to the next milestone, 1.16.1

Metadata Update from @jhrozek:
- Issue tagged with: PR

4 years ago

Metadata Update from @lslebodn:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

4 years ago

SSSD is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in SSSD's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/SSSD/sssd/issues/4494

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Login to comment on this ticket.

Metadata