#3362 fiter_users and filter_groups stop working properly in v 1.15

Created 7 months ago by akomlik
Modified 4 months ago

After upgrade from 1.14 to 1.15 I noticed odd behavior of filter_* settings
If local user or group absent this works fine and prevents accounts that exist in LDAP from showing up in the host
Example of config:

# grep named /etc/sssd/sssd.conf 
filter_users = avahi,dbus,haldaemon,ldap,named,news,nscd,radiusd,root
filter_groups = named,stapdev,stapusr,uuidd

# grep sss /etc/nsswitch.conf
passwd:     files sss
shadow:     files sss
group:      files sss

Now without local named account the user does not exist as expected:

# grep named /etc/passwd /etc/group 
# id named
id: named: no such user

However, both user and group named still show up as enumerated "leaking:" from LDAP

# getent passwd named
# getent passwd | grep named:
named:*:25:25:Service Account named:/var/named:/bin/bash
# getent group  named
# getent group  | grep named
named:*:25:

Now, trying to add local user with running SSSD causes error and weird behavior. the user appears to be created but yet not found in local files.

# useradd -u 25 named
useradd: UID 25 is not unique
# getent passwd 25
named:*:25:25:Service Account named:/var/named:/bin/bash
# getent passwd named
named:*:25:25:Service Account named:/var/named:/bin/bash
# grep named /etc/passwd
# id named
uid=25(named) gid=25(named) groups=25(named)
# grep named /etc/passwd  /etc/group
#

After sssd restart it disappears

# service sssd restart
Redirecting to /bin/systemctl restart  sssd.service
# id named
id: named: no such user

After downgrade to sssd 1.14 all works as expected and user/group named never shows up from LDAP

sssd.conf

Attachments
sssd.conf - 2017-04-04 17:45:28 From Issue description Download
sssd.spec - 2017-04-04 19:52:28 Comment Download
sssd_nss.log.gz - 2017-05-17 20:02:47 Comment Download
sssd_nss.log.gz - 2017-05-18 10:16:30 Comment Download

What distribution do you use?

We use CentOS-7 here
I built RPM out of sssd-1.15 distribution using settings that match stock RPM spec file to make direct replacement

Does your specfile also enable the files domain? (or is there one more sssd_be process running with domain implicit_files) ?

No, only one sssd_be process

sssd.spec

However, both user and group named still show up as enumerated "leaking:" from LDAP

How did you found out they are leaked from LDAP?
Because of different home directory?

I would recommend to
increase debug_level (8 or 9) in "[nss]" section of sssd.conf
remove sssd cache rm -f /var/lib/sss/db/*
restart sssd
call getent passwd -s named
* check log files whether there is something about filtering users.

Now, trying to add local user with running SSSD causes error and weird behavior. the user appears to be created but yet not found in local files.

You didn't add local user with useradd -u 25 named because UID is not uniq. and therefore you cannot find it in files.

BTW It would be good to know details about version of sssd.
sssd-1.15 is not exact. Could you provide git hash?

Hmm, I can reproduce it with git master.
I assume it is caused by cache_req refactoring.

Thank you very much for bug report.

Glad you can reproduce it guys!
Yes, could tell local vs LDAP entry by GECOS field, home, etc.

I first stumble on this when our admins complained to me that on hosts they installed bind RPM after system reboot bind would not start complaining of missing user "named"
So I started digging into this and discovered this bug.
Now I understand what was happening then:
- bind RPM pre-install script would try to create local user named and that would result in error logged into /var/log/secure however user "appeared" to have been created just like in my example so bind would start right after that and run till system rebooted. But in reality no local file entry was created
- after system reboot ( = SSSD restart) since no local user was present and named was (semi) filtered from LDAP bind would not start.

I was pulling my hair for trying to figure this one out and then tried for the heck of it downgrade back to 1.14 and then all was good again. Upgraded to 1.15 - problem
I even built using latest sssd-1_15_2 git tag RPM yesterday only to discover the issue was still there

$ git log -1
commit 4c9419d98b89a6161a3dde11f9f80be39d12e72a
Author: Jakub Hrozek <jhrozek@redhat.com>
Date:   Wed Mar 15 17:53:24 2017 +0100

 Updating the translations for the 1.15.2 release
Edited 7 months ago by akomlik

Actually I cannot reproduce with git master.
I had a typo in sssd.conf filter_user vs filter_users

Oh! Not good.

Some additional points. In presence of local account both local and LDAP entries show up

[root@m0072487 akomlik]# service sssd start
[root@m0072487 akomlik]# getent group named
[root@m0072487 akomlik]# id named
id: named: no such user
[root@m0072487 akomlik]# grep named /etc/passwd /etc/group

[root@m0072487 akomlik]# service sssd stop
Redirecting to /bin/systemctl stop  sssd.service

[root@m0072487 akomlik]# yum install -y -q bind
[root@m0072487 akomlik]# grep named /etc/passwd /etc/group
/etc/passwd:named:x:25:25:Named:/var/named:/sbin/nologin
/etc/group:named:x:25:

[root@m0072487 akomlik]# rm -fr /var/lib/sss/db/* /var/log/sssd/*
[root@m0072487 akomlik]# service sssd start
Redirecting to /bin/systemctl start  sssd.service

[root@m0072487 akomlik]# getent group | grep named:
named:x:25:
named:*:25:
[root@m0072487 akomlik]# getent passwd | grep named:
named:x:25:25:Named:/var/named:/sbin/nologin
named:*:25:25:Service Account named:/var/named:/bin/bash
[root@m0072487 akomlik]#

I assume that the 1st getent group named didn't return anything because enumeration has not finished yet.

Either try to wait few seconds or test with disabled enumeration.

And try to follow document
https://fedorahosted.org/sssd/wiki/Troubleshooting.

Or provide reliable reproducer :-)

First "getent group named" did not return anything because of the filter_groups statement.

We're forced to use enumeration or else init_groups takes forever due dozen thousands of LDAP queries required to perform init_groups for a user who logs in (look at our sssd.conf sample attached with elaborate LDAP filter controlling host access). This leads to 30s first login delays on a LAN and >1min logins when LDAP server is connected over WAN link and is unacceptable.
With enumeration ON all groups are cached locally and logins are instantaneous.

I'm disabling enumeration and testing again:

  1. Starting point:
    - user/group named filtered out
    - enumerations turned off

    [root @m0072487 akomlik]# egrep 'named|^enumerate' /etc/sssd/sssd.conf
    filter_users = avahi,dbus,haldaemon,ldap,named,news,nscd,radiusd,root
    filter_groups = named,stapdev,stapusr,uuidd
    enumerate=FALSE

  2. Clean cache and restart

    [root @m0072487 akomlik]# service sssd stop; rm -fr /var/lib/sss/db/ /var/log/sssd/ ; service sssd start
    Redirecting to /bin/systemctl stop sssd.service
    Redirecting to /bin/systemctl start sssd.service

  3. Checking for group named returns nothing as expected

    [root @m0072487 akomlik]# getent -s sss group named

  4. Checking for GID 25 however returns LDAP entry

    [root @m0072487 akomlik]# getent -s sss group 25
    named:*:25:

  5. After this check bny name also returns entry

    [root @m0072487 akomlik]# getent -s sss group named
    named:*:25:

  6. Trying to create local group fails now despite the expectations.

    [root @m0072487 akomlik]# groupadd -g 25 named
    groupadd: group 'named' already exists

Lukas, would you like me to conduct online screen-sharing session so we can get to the bottom of this faster?

As I mentioned in original description, once I downgrade to 1.14 things are back to normal:
- no group is return either by name named or by GID 25 from LDAP
- I can create group and user no problem

[root@m0072487 SPECS]# yum -q -y downgrade -q `rpm -qa --queryformat "%{NAME}-1.14.0-43.el7 " "*sss*" "libipa*"`
[root@m0072487 SPECS]#  service sssd stop; rm -fr /var/lib/sss/db/* /var/log/sssd/* ; service sssd start
Redirecting to /bin/systemctl stop  sssd.service
Redirecting to /bin/systemctl start  sssd.service
[root@m0072487 SPECS]# getent group named
[root@m0072487 SPECS]# getent group 25
[root@m0072487 SPECS]# groupadd -g 25 named
[root@m0072487 SPECS]# useradd -u 25 -g named named
[root@m0072487 SPECS]#

Thank you very much. I missed the step with query for GID.

getent -s sss group 25 # sssd 1.14 did not return anything.
getent -s sss group named

We can reproduce it now.

Git bisect says that 1st broken commit is
4049b63 "nss: rewrite nss responder so it uses cache_req".

I expected that.

And here is the explanation why it worked in 1.14. We checked negative cache before the request but also after we got results from back-end

(Wed Apr  5 17:46:52 2017) [sssd[nss]] [nss_cmd_getbyid] (0x0400): Running command [18][SSS_NSS_GETPWUID] with id [1002].
(Wed Apr  5 17:46:52 2017) [sssd[nss]] [sss_ncache_check_str] (0x2000): Checking negative cache for [NCE/UID/1002]

(Wed Apr  5 17:46:52 2017) [sssd[nss]] [nss_cmd_getpwuid_search] (0x0100): Requesting info for [1002@LDAP]
(Wed Apr  5 17:46:52 2017) [sssd[nss]] [get_dp_name_and_id] (0x0400): Not a LOCAL view, continuing with provided values.
(Wed Apr  5 17:46:52 2017) [sssd[nss]] [sss_dp_issue_request] (0x0400): Issuing request for [0x42960a:1:1002@LDAP]
(Wed Apr  5 17:46:52 2017) [sssd[nss]] [sss_dp_get_account_msg] (0x0400): Creating request for [LDAP][0x1][BE_REQ_USER][1][idnumber=1002:-]
(Wed Apr  5 17:46:52 2017) [sssd[nss]] [sbus_add_timeout] (0x2000): 0x1156de0
(Wed Apr  5 17:46:52 2017) [sssd[nss]] [sss_dp_internal_get_send] (0x0400): Entering request [0x42960a:1:1002@LDAP]
(Wed Apr  5 17:46:52 2017) [sssd[nss]] [sbus_remove_timeout] (0x2000): 0x1156de0
(Wed Apr  5 17:46:52 2017) [sssd[nss]] [sbus_dispatch] (0x4000): dbus conn: 0x1155b10
(Wed Apr  5 17:46:52 2017) [sssd[nss]] [sbus_dispatch] (0x4000): Dispatching.
(Wed Apr  5 17:46:52 2017) [sssd[nss]] [sss_dp_get_reply] (0x1000): Got reply from Data Provider - DP error code: 0 errno: 0 error message: Success
(Wed Apr  5 17:46:52 2017) [sssd[nss]] [nss_cmd_getpwuid_search] (0x0100): Requesting info for [1002@LDAP]
(Wed Apr  5 17:46:52 2017) [sssd[nss]] [nss_cmd_getpwuid_search] (0x0400): Returning info for uid [1002@LDAP]
(Wed Apr  5 17:46:52 2017) [sssd[nss]] [sss_ncache_check_str] (0x2000): Checking negative cache for [NCE/USER/LDAP/user2@ldap]

(Wed Apr  5 17:46:52 2017) [sssd[nss]] [fill_pwent] (0x0400): User [user2@ldap] filtered out! (negative cache)
(Wed Apr  5 17:46:52 2017) [sssd[nss]] [sss_dp_req_destructor] (0x0400): Deleting request: [0x42960a:1:1002@LDAP]
7 months ago

Metadata Update from @jhrozek:
- Issue set to the milestone: SSSD 1.15.3

7 months ago

Metadata Update from @jhrozek:
- Custom field rhbz adjusted to https://bugzilla.redhat.com/show_bug.cgi?id=1440132

7 months ago

Metadata Update from @jhrozek:
- Custom field rhbz adjusted to https://bugzilla.redhat.com/show_bug.cgi?id=1440132

Issue linked to Bugzilla: Bug 1440132

7 months ago

Metadata Update from @fidencio:
- Issue assigned to fidencio

7 months ago

Metadata Update from @fidencio:
- Custom field patch adjusted to on

7 months ago

Metadata Update from @jhrozek:
- Issue priority set to: critical

6 months ago

Metadata Update from @jhrozek:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

So, I built sssd RPMs using latest snapshot and still see bizzare issues with user/group "leaking" from LDAP despite using filters

[root@m0072487 ~]# rpm -q sssd-common
sssd-common-1.15.3-0.20170516.0926.git6a1da82.el7_3.x86_64

[root@m0072487 ~]# egrep 'named|enumerate'  /etc/sssd/sssd.conf 
filter_users = avahi,dbus,haldaemon,ldap,named,news,nscd,radiusd,root
filter_groups = named,stapdev,stapusr,uuidd
enumerate = True

performing clean start:

[root@m0072487 ~]# service sssd stop ; rm -f /var/lib/sss/db/* /var/log/sssd/* ; service sssd start
Redirecting to /bin/systemctl stop  sssd.service
Redirecting to /bin/systemctl start  sssd.service

At first, things apper to be working as expected. No local or LDAP user/group named exists on a system:

[root@m0072487 ~]# getent passwd named
[root@m0072487 ~]# getent group named

However, trying now to add local group named I get an error due to existing GID conflict:

[root@m0072487 ~]# groupadd -g 25 named
groupadd: GID '25' already exists
[root@m0072487 ~]# getent  group named
named:*:25:
[root@m0072487 ~]#

After waiting for few minutes this phantom group disappears and things look normal:

[root@m0072487 ~]# getent group named
[root@m0072487 ~]#

However, as soon as I try to create local group same error comes up again:

[root@m0072487 ~]# groupadd -g 25 named
groupadd: GID '25' already exists
[root@m0072487 ~]# getent group named
named:*:25:
[root@m0072487 ~]#

Next, I disable enumeration and test again. Now I can create local group but then it fails on user creation for same reason. Cache cleanup and restart makes it look good.

[root@m0072487 ~]# egrep 'named|enumerate'  /etc/sssd/sssd.conf 
filter_users = avahi,dbus,haldaemon,ldap,named,news,nscd,radiusd,root
filter_groups = named,stapdev,stapusr,uuidd
enumerate = FALSE
[root@m0072487 ~]# service sssd stop ; rm -f /var/lib/sss/db/* /var/log/sssd/* ; service sssd start
Redirecting to /bin/systemctl stop  sssd.service
Redirecting to /bin/systemctl start  sssd.service
[root@m0072487 ~]# getent group named
[root@m0072487 ~]# getent passwd named
[root@m0072487 ~]# getent passwd 25
[root@m0072487 ~]# getent group 25
[root@m0072487 ~]# groupadd -g 25 named
[root@m0072487 ~]# useradd -u 25 -g named named
useradd: UID 25 is not unique
[root@m0072487 ~]# getent passwd named
named:*:25:25:Service Account named:/var/named:/bin/bash
[root@m0072487 ~]# service sssd stop ; rm -f /var/lib/sss/db/* /var/log/sssd/* ; service sssd start
Redirecting to /bin/systemctl stop  sssd.service
Redirecting to /bin/systemctl start  sssd.service
[root@m0072487 ~]# getent passwd named

I'm reopening the ticket as I'm able to reproduce the issue.

6 months ago

Metadata Update from @fidencio:
- Issue status updated to: Open (was: Closed)

@akomlik, would you be willing to try a testing patch from me?

I'm kinda able to reproduce the issue but, unfortunately, not in a reliable way.
I do have a blind fix that may help your case ...

@akomlik: Forget about testing. I've just learned that SSSD team do not prefer asking people to test our stuff.

Can you provide a 100% reliable reproducer?

Edited 6 months ago by lslebodn

@akomlik: Forget about testing. I've just learned that SSSD team do not prefer asking people to test our stuff.

FWIW, I think it's a good idea to ask reporters to test patches, as long as the NVR of the test package is sane so that they can later upgrade to a later version.

@akomlik,

I can reliable reproduce the issue when enumeration is enabled. However, I could reproduce only once when enumeration is not enabled.

May I ask you to also provide sssd_nss.log? (debug_level = 8 is the log level that I need).

Sure, I'll run the test again with log debug level =8 and attach it here.

I also noticed that with enumeration the issue is more pronounced vs without.
We have to use enumeration or else resolving complex LDAP filter leads to huge amount of traffic to LDAP server so that simple login could take a minute even on fast network. Enumeration allows prepopulate all users and groups into local SSSD cache so that when admins log in to a host first time it happens quickly.

Super! I've noticed that we haven't been filtering out users when those are found directly in the cache (and that's the code path taken when using enumeration). So, for this one I do have a fix.

The case without enumeration still bugs me and the logs are going to help a lot.

Also, I do believe I've found a simpler reproducer for the enumeration case:

[root@client1 vagrant]# egrep 'user00|enumerate' /etc/sssd/sssd.conf 
enumerate = True
filter_users = user00, user01
filter_groups = user00, user01
[root@client1 vagrant]# rm -rf /var/log/sssd/sssd* /var/lib/sss/db/* ; systemctl restart sssd                                           
[root@client1 vagrant]# id 1790400001
uid=1790400001(user00) gid=1790400001(user00) groups=1790400001(user00)

thanks. I just ran my test with enumeration off and it failed in same way:

[root@m0072487 /]# egrep 'named|enumerate'  /etc/sssd/sssd.conf 
filter_users = avahi,dbus,haldaemon,ldap,named,news,nscd,radiusd,root
filter_groups = named,stapdev,stapusr,uuidd
enumerate = FALSE
[root@m0072487 /]# service sssd stop ; rm -f /var/lib/sss/db/* /var/log/sssd/* ; service sssd start
Redirecting to /bin/systemctl stop  sssd.service
Redirecting to /bin/systemctl start  sssd.service
root@m0072487 /]# getent passwd 25
[root@m0072487 /]# getent group named
[root@m0072487 /]# getent group 25
[root@m0072487 /]# !groupad
groupadd -g 25 named
groupadd: GID '25' already exists
    [root@m0072487 /]# id 25
    uid=25(named) gid=25(named) groups=25(named)
   [root@m0072487 /]#

Attaching the nss log
sssd_nss.log.gz

Edited 6 months ago by akomlik

Okay. I can reproduce it here following these steps.

My understanding of the problem is that it may happen in this case (where enumeration is disabled) because the first few getent calls done will end up adding the group to the cache and when we search by the group during the groupadd (or even a simple id 25 would trigger the issue) we don't filter it out when the result is coming from the cache (and here you can see the same issue with the enumeration).

I'll write a test and submit a PR for this issue.

If you want to give it a try to a patch, feel free: https://github.com/fidencio/sssd/tree/wip/3362

If you decide to test the patch (and I really would appreciate that), keep in mind that it's something not recommended/desired by, at least, part of the SSSD team.

Thanks. I'll give that patch a try first thing tomorrow

FWIW, I think it's a good idea to ask reporters to test patches, as long as the NVR of the test package is sane so that they can later upgrade to a later version.

It does not scale. Reported will not test the same use-case after each commit in upstream. And without reliable reproducer we cannot write a test and prevent regressions in future.

Anyway, @akomlik thank you very much for bug report and time which you spent with it. I am so sorry for regressions which were introduced in 1.15. It happens with refactoring. We caught many crashes and regressions with existing tests. But it looks like filtering users/groups had very low test coverage. We need to improve it in upstream.

FWIW, I think it's a good idea to ask reporters to test patches, as long as the NVR of the test package is sane so that they can later upgrade to a later version.

It does not scale. Reporters will not test the same use-case after each commit in upstream. And without reliable reproducer we cannot write a test and prevent regressions in future.

Anyway, @akomlik thank you very much for bug report and time which you spent with it. I am so sorry for regressions which were introduced in 1.15. It happens with refactoring. We caught many crashes and regressions with existing tests. But it looks like filtering users/groups had very low test coverage. We need to improve it in upstream.

Edited 6 months ago by lslebodn

Greetings!
I build new RPMs using the patch and still see the problem

[root@m0072487 /]# d 25
uid=25(named) gid=25 groups=25
[root@m0072487 /]# id named
uid=25(named) gid=25 groups=25
[root@m0072487 /]#

sssd_nss.log.gz

Amazing! Thanks a lot for the testing and it shows me there's something else weird happening.

5 months ago

Metadata Update from @lslebodn:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

@akomlik
If you want we can prepare test build for you.

@akomlik
If you want we can prepare test build for you.

yes, please make it.

great, I just tried it on my server and things are back to normal.

When can we expect next (1.15.3) release with this fix included?

When can we expect next (1.15.3) release with this fix included?

I would expect a week or two.

Edited 5 months ago by lslebodn

I still don't see 1.15.3 release

Login to comment on this ticket.

on

https://bugzilla.redhat.com/show_bug.cgi?id=1440132

cancel