#3593 supplemental groups disappear (regression?)
Closed: worksforme 6 years ago Opened 6 years ago by robinpblanchard.

We have discovered a seeming regression in the latest SSSD. Upon initial start of sssd, having first fully removed cache files (rm -rf /var/lib/sss/{mc,db}/*), users' groups are fully enumerated (via id/groups). After some time (measured in minutes less than one hour) passes, users groups are suddenly no longer fully enumerated, instead returning only the primary group.

At this point of failure, getent group somegroup for a known group of a user still shows his membership (yet it is not realized in his/her environment). Restarting SSSD (with a complete removal of cache files) allows group memberships again to function (for some time measured in minutes). Note:sss_cache -E does NOT suffice to clear the condition.

SSS's nss logs at time of failure to refresh user's group memberships:

(Fri Dec 1 15:48:00 2017) [sssd[nss]] [nss_getby_name] (0x0400): Input name: MYUSER
(Fri Dec 1 15:48:00 2017) [sssd[nss]] [cache_req_send] (0x0400): CR #428: New request 'Initgroups by name'
(Fri Dec 1 15:48:00 2017) [sssd[nss]] [cache_req_process_input] (0x0400): CR #428: Parsing input name [MYUSER]
(Fri Dec 1 15:48:00 2017) [sssd[nss]] [sss_parse_name_for_domains] (0x0200): name 'MYUSER' matched without domain, user is MYUSER
(Fri Dec 1 15:48:00 2017) [sssd[nss]] [cache_req_set_name] (0x0400): CR #428: Setting name [MYUSER]
(Fri Dec 1 15:48:00 2017) [sssd[nss]] [cache_req_select_domains] (0x0400): CR #428: Performing a multi-domain search
(Fri Dec 1 15:48:00 2017) [sssd[nss]] [cache_req_search_domains] (0x0400): CR #428: Search will check the cache and check the data provider
(Fri Dec 1 15:48:00 2017) [sssd[nss]] [cache_req_set_domain] (0x0400): CR #428: Using domain [MY.DOMAIN]
(Fri Dec 1 15:48:00 2017) [sssd[nss]] [cache_req_prepare_domain_data] (0x0400): CR #428: Preparing input data for domain [MY.DOMAIN] rules
(Fri Dec 1 15:48:00 2017) [sssd[nss]] [cache_req_search_send] (0x0400): CR #428: Looking up MYUSER@MY.DOMAIN
(Fri Dec 1 15:48:00 2017) [sssd[nss]] [cache_req_search_ncache] (0x0400): CR #428: Checking negative cache for [MYUSER@MY.DOMAIN]
(Fri Dec 1 15:48:00 2017) [sssd[nss]] [cache_req_search_ncache] (0x0400): CR #428: [MYUSER@MY.DOMAIN] is not present in negative cache
(Fri Dec 1 15:48:00 2017) [sssd[nss]] [cache_req_search_cache] (0x0400): CR #428: Looking up [MYUSER@MY.DOMAIN] in cache
(Fri Dec 1 15:48:00 2017) [sssd[nss]] [cache_req_search_send] (0x0400): CR #428: Object found, but needs to be refreshed.
(Fri Dec 1 15:48:00 2017) [sssd[nss]] [cache_req_search_dp] (0x0400): CR #428: Looking up [MYUSER@MY.DOMAIN] in data provider
(Fri Dec 1 15:48:00 2017) [sssd[nss]] [sss_dp_issue_request] (0x0400): Issuing request for [0x562df3167b10:3:MYUSER@MY.DOMAIN@MY.DOMAIN]
(Fri Dec 1 15:48:00 2017) [sssd[nss]] [sss_dp_get_account_msg] (0x0400): Creating request for [MY.DOMAIN][0x3][BE_REQ_INITGROUPS][name=MYUSER@MY.DOMAIN:-]
(Fri Dec 1 15:48:00 2017) [sssd[nss]] [sss_dp_internal_get_send] (0x0400): Entering request [0x562df3167b10:3:MYUSER@MY.DOMAIN@MY.DOMAIN]
(Fri Dec 1 15:48:00 2017) [sssd[nss]] [nss_memorycache_update_initgroups] (0x1000): Updating inigroups memory cache of [MYUSER@MY.DOMAIN@MY.DOMAIN]
(Fri Dec 1 15:48:00 2017) [sssd[nss]] [sss_dp_get_reply] (0x1000): Got reply from Data Provider - DP error code: 0 errno: 0 error message: Success
(Fri Dec 1 15:48:00 2017) [sssd[nss]] [cache_req_search_cache] (0x0400): CR #428: Looking up [MYUSER@MY.DOMAIN] in cache
(Fri Dec 1 15:48:00 2017) [sssd[nss]] [cache_req_search_ncache_filter] (0x0400): CR #428: This request type does not support filtering result by negative cache
(Fri Dec 1 15:48:00 2017) [sssd[nss]] [cache_req_search_done] (0x0400): CR #428: Returning updated object [MYUSER@MY.DOMAIN]
(Fri Dec 1 15:48:00 2017) [sssd[nss]] [cache_req_create_and_add_result] (0x0400): CR #428: Found 94 entries in domain MY.DOMAIN
(Fri Dec 1 15:48:00 2017) [sssd[nss]] [cache_req_done] (0x0400): CR #428: Finished: Success
(Fri Dec 1 15:48:00 2017) [sssd[nss]] [nss_protocol_fill_initgr] (0x0020): Incomplete group object [name=Users@MY.DOMAIN,cn=groups,cn=MY.DOMAIN,cn=sysdb] for initgroups! Aborting.
(Fri Dec 1 15:48:00 2017) [sssd[nss]] [sss_dp_req_destructor] (0x0400): Deleting request: [0x562df3167b10:3:MYUSER@MY.DOMAIN@MY.DOMAIN]
(Fri Dec 1 15:48:00 2017) [sssd[nss]] [client_recv] (0x0200): Client disconnected!

This error is NOT manifested at clean start of SSSD -- only once it has entered the described "failed" state.

environment

rpm -qa |grep sssd

python-sssdconfig-1.15.2-50.el7_4.6.noarch
sssd-krb5-common-1.15.2-50.el7_4.6.x86_64
sssd-ad-1.15.2-50.el7_4.6.x86_64
sssd-ipa-1.15.2-50.el7_4.6.x86_64
sssd-proxy-1.15.2-50.el7_4.6.x86_64
sssd-common-pac-1.15.2-50.el7_4.6.x86_64
sssd-krb5-1.15.2-50.el7_4.6.x86_64
sssd-ldap-1.15.2-50.el7_4.6.x86_64
sssd-common-1.15.2-50.el7_4.6.x86_64
sssd-1.15.2-50.el7_4.6.x86_64
sssd-client-1.15.2-50.el7_4.6.x86_64

uname -r

4.14.0-1.el7.elrepo.x86_64

cat /etc/redhat-release

CentOS Linux release 7.4.1708 (Core)

relevant AD domain snippet from sssd.conf:

use_fully_qualified_names = false
full_name_format = %1$s
case_sensitive = false

id_provider = ad
auth_provider = ad
access_provider = ad
chpass_provider = ad

ldap_id_mapping = true
ldap_schema = ad

ad_hostname = my.f.q.d.n
ad_domain = my.domain
krb5_realm = MY.DOMAIN
enumerate = true
cache_credentials = true
krb5_store_password_if_offline = true


I wrote a quick and dirty script to attempt to figure out the "time" after which the failure condition occurs. It appears to be 15 minutes:

groups: 91
at Mon Dec 4 11:38:28 AST 2017
groups: 2
at Mon Dec 4 11:53:33 AST 2017

With 15 minutes in mind, I returned to the logs (debug level 7), and see

(Mon Dec 4 11:38:13 2017) [sssd[be[my.domain.com]]] [sdap_get_tgt_recv] (0x0400): Child responded: 0 [FILE:/var/lib/sss/db/ccache_MY.DOMAIN.COM], expired on [1512437893]
(Mon Dec 4 11:38:13 2017) [sssd[be[my.domain.com]]] [sdap_cli_auth_step] (0x0100): expire timeout is 900
(Mon Dec 4 11:38:13 2017) [sssd[be[my.domain.com]]] [sdap_cli_auth_step] (0x1000): the connection will expire at 1512402793

so it seems as if the cache (upon expiry) can't be properly updated? Yet it can be properly populated on clean start?

The expire timeout in the log messages of sdap_cli_auth_step() is the lifetime of the Kerberos ticket and the connection to the LDAP server, this is not related to cached data.

The 'Incomplete group object' error message indicates that a group the user is a member of has a missing GID or GID=0 and the group is not properly flagged as non-POSIX group in the cache. Do you know what kind of group Users@MY.DOMAIN is in AD? The backend is the only process which should update the GID, so please check sssd_my.domain.log for any activity for this group. Feel free to attach the (sanitized) log to this ticket so that I can have a look as well.

so...it's not specifically 15 minutes:

groups: 91
at Mon Dec 4 15:01:31 AST 2017
groups: 2
at Mon Dec 4 15:06:31 AST 2017

but it certainly smells like multiples of 300 seconds (ldap_enumeration_refresh_timeout ?)...I continue to peruse logs.

edit:
This 300 second (5 min) failure was when explicitly setting ldap_user_search_base. Presumably, then, the 15 minute (900) second version is perhaps due to reconnection_retries = 3.

log analysis continues

during initial/success -> failure

egrep 'nss_protocol_fill_initgr|enumeration'

(Mon Dec 4 16:49:14 2017) [sssd[be[my.domain.com]]] [dp_get_options] (0x0400): Option ldap_enumeration_refresh_timeout has value 300
(Mon Dec 4 16:49:14 2017) [sssd[be[my.domain.com]]] [dp_get_options] (0x0400): Option ldap_enumeration_search_timeout has value 60
(Mon Dec 4 16:49:14 2017) [sssd[be[my.domain.com]]] [sdap_id_setup_tasks] (0x0400): Setting up enumeration for my.domain.com
(Mon Dec 4 16:49:14 2017) [sssd[be[my.domain.com]]] [be_ptask_create] (0x0400): Periodic task [enumeration] was created
(Mon Dec 4 16:49:14 2017) [sssd[be[my.domain.com]]] [be_ptask_schedule] (0x0400): Task [enumeration]: scheduling task 0 seconds from now [1512420554]
(Mon Dec 4 16:49:14 2017) [sssd[be[my.domain.com]]] [be_ptask_execute] (0x0400): Task [enumeration]: executing task, timeout 300 seconds
(Mon Dec 4 16:49:23 2017) [sssd[be[my.domain.com]]] [be_ptask_done] (0x0400): Task [enumeration]: finished successfully
(Mon Dec 4 16:49:23 2017) [sssd[be[my.domain.com]]] [be_ptask_schedule] (0x0400): Task [enumeration]: scheduling task 300 seconds from last execution time [1512420854]
(Mon Dec 4 16:54:14 2017) [sssd[be[my.domain.com]]] [be_ptask_execute] (0x0400): Task [enumeration]: executing task, timeout 300 seconds
(Mon Dec 4 16:54:14 2017) [sssd[be[my.domain.com]]] [be_ptask_done] (0x0400): Task [enumeration]: finished successfully
(Mon Dec 4 16:54:14 2017) [sssd[be[my.domain.com]]] [be_ptask_schedule] (0x0400): Task [enumeration]: scheduling task 300 seconds from last execution time [1512421154]
(Mon Dec 4 16:59:14 2017) [sssd[be[my.domain.com]]] [be_ptask_execute] (0x0400): Task [enumeration]: executing task, timeout 300 seconds
(Mon Dec 4 16:59:14 2017) [sssd[be[my.domain.com]]] [be_ptask_done] (0x0400): Task [enumeration]: finished successfully
(Mon Dec 4 16:59:14 2017) [sssd[be[my.domain.com]]] [be_ptask_schedule] (0x0400): Task [enumeration]: scheduling task 300 seconds from last execution time [1512421454]
(Mon Dec 4 17:04:14 2017) [sssd[be[my.domain.com]]] [be_ptask_execute] (0x0400): Task [enumeration]: executing task, timeout 300 seconds
(Mon Dec 4 17:04:19 2017) [sssd[be[my.domain.com]]] [be_ptask_done] (0x0400): Task [enumeration]: finished successfully
(Mon Dec 4 17:04:19 2017) [sssd[be[my.domain.com]]] [be_ptask_schedule] (0x0400): Task [enumeration]: scheduling task 300 seconds from last execution time [1512421754]
(Mon Dec 4 17:04:29 2017) [sssd[be[my.domain.com]]] [be_ptask_destructor] (0x0400): Terminating periodic task [enumeration]

(Mon Dec 4 17:04:29 2017) [sssd[nss]] [nss_protocol_fill_initgr] (0x0020): Incomplete group object [name=Users@my.domain.com,cn=groups,cn=my.domain.com,cn=sysdb] for initgroups! Aborting.

./test.sh
groups: 92
at Mon Dec 4 16:49:25 AST 2017
groups: 2
at Mon Dec 4 17:04:29 AST 2017

!/usr/bin/env bash

set -e

systemctl stop sssd
rm -rf /var/lib/sss/{mc,db}/ /var/log/sssd/
systemctl restart sssd

groupcount=$(id MYADUSER|tr ',' '\n' |wc -l)
echo -e "groups: ${groupcount}\n\tat $(date)"

while true ; do
groupcount=$(id MYADUSER|tr ',' '\n' |wc -l)
if [[ "${groupcount}" -lt 10 ]] ; then
echo -e "groups: ${groupcount}\n\tat $(date)"
break
fi
sleep 30
done

systemctl stop sssd
sleep 5
rsync -avP /var/log/sssd /var/tmp/
systemctl restart sssd

I missed that you have 'enumerate = True'.

Can you check if it happens with 'enumerate = False' as well?

BTW if you use sssd with Active Directory then you might try to disable lookup in global catalog or replicate POSIX attributes to GC.

man sssd-ad -> ad_enable_gc

But I agree that sanitized log files + sssd configuration would be helpful.

BTW you might check following page https://docs.pagure.org/SSSD.sssd/users/troubleshooting.html or even more on here https://pagure.io/docs/SSSD/sssd/

interesting. after ~3 years of successfully running SSSD in production with enumeration = true, now setting it to false seems to prevent the loss of groups described in initial ticket description. I continue to monitor testing and will confirm

On (05/12/17 15:34), Robin P Blanchard wrote:

interesting. after ~3 years of successfully running SSSD in production with enumeration = true, now setting it to false seems to prevent the loss of groups described in initial ticket description. I continue to monitor testing and will confirm

I would expect that something was changed on directory server after 3 years :-)

LS

any theories as to why it would enumerate successfully on clean (no cache present) start, then subsequently fail (and never self-resolve) ?

On (05/12/17 16:11), Robin P Blanchard wrote:

any theories as to why it would enumerate successfully on clean (no cache present) start, then subsequently fail ?

It is impossible to say without more data/logs. I had only a wild guess.

LS

sadly, enumerate = false ultimately leads to the same end-state: all supplemental groups disappear. will attempt to sanitize the mountain of logs....

We would really like to help but we can't.

Please either provide sanitized log files + steps which you did
https://docs.pagure.org/SSSD.sssd/users/reporting_bugs.html

Or try to follow troubleshooting page which might help you to find problem yourself.
https://docs.pagure.org/SSSD.sssd/users/troubleshooting.html

I believe this may be closed. The final failed test was inappropriately run as an AD user (for whom the SSSD config changed whilst being logged in). enumerate = false has since returned all groups.

thank you for listening, your suggestions, and your patience.

I believe this may be closed. The final failed test was inappropriately run as an AD user (for whom the SSSD config changed whilst being logged in). enumerate = false has since returned all groups.
thank you for listening, your suggestions, and your patience.

I'm glad it works now.

Metadata Update from @lslebodn:
- Issue close_status updated to: worksforme
- Issue status updated to: Closed (was: Open)

6 years ago

SSSD is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in SSSD's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/SSSD/sssd/issues/4616

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Login to comment on this ticket.

Metadata