#49951 RFE improve deref plugin performance
Opened a year ago by tbordaz. Modified a year ago

Issue Description

If a search with a deref control hits large groups, it will trigger a large number of internal search to retrieve some specific values in the members.

If the response time is too high, client has two options: cache the results (but it should reload them after some timeout) or do the deref by itself (doing individual search for each member that is also long). SSSD is proposing the two options.

This creates a specific issue for diskless client. On reboot clients (sssd) have no cache and issue an expensive deref (or do the deref itself).

This RFE is to enhance deref plugin with a caching mechanism.
For each membership attribute, the plugin keep a table (hash) with the key based on the normalized DN. For a given DN, we have the list of a set of cached attributes (to be added in the config).
On post add/del/mod/modrdn, if a cached attribute is changed it updates the values in the table.
On deref search, for each member it lookup in the table to retrieve the attribute value. If the attribute is not cache it does the internal search.

The priming of the caches should make the cache available only after some time. In the interval, the it is doing current internal searches

Package Version and Platform

All version

Steps to reproduce

TBD

Actual results

deref a large group triggers many internal searches

Expected results

deref a large group should do internal search for very few entries (in theory none).


I think this is large enough that I'd like to see a design document, especially around how you plan to manage the cache lifecycle to prevent stale data or entries persisting. I'm also concerned that perhaps we are adding a cache where the real problem is hiding elsewhere ....

Metadata Update from @firstyear:
- Custom field component adjusted to None
- Custom field origin adjusted to None
- Custom field reviewstatus adjusted to None
- Custom field type adjusted to None
- Custom field version adjusted to None

a year ago

Metadata Update from @tbordaz:
- Issue assigned to tbordaz

a year ago

@firstyear thanks for having looked at it. I fully agree that if this ticket is accepted it will require a design. I just mentioned the overview of a possible improvement and it is not enough to detect blockers/drawbacks.
My initial look and computation showed that #internal search was a important factor. For example if we need 1000search for a deref (IMHO it is optimistic) a single deref could put a server on the knees for 1s and we can imagine several deref in parallel (several server starting in parallel). Also if a member is lookup N times (in a single deref or multiple deref), it costs N although the attribute retrieved at the first lookup is likely unchanged.
But I agree with you I will try to gather some more metric to confirm deref is a main contributor.

@tbordaz Great! I look forward to reading your findings and discussing it further what we can do.

- Test environment is composed on 2 ipa instances and 5 clients
  The IPA instances are provisioned with https://github.com/freeipa/freeipa-tools/blob/master/create-test-data.py
        users=5000,
        groups=100,
        groups_per_user=20,
        nested_groups_max_level=2,
        nested_groups_per_user=5,
        hosts=15000,
        hostgroups=500,
        hostgroups_per_host=10,
        nested_hostgroups_max_level=2,
        nested_hostgroups_per_host=5,
        direct_sudorules=20,  # users, hosts
        indirect_sudorules=80,  # groups, hostgroups
        sudorules_per_user=5,
        sudorules_per_group=2,
        sudorules_per_host=5,
        sudorules_per_hostgroup=5,
        direct_hbac=10,  # users, hosts
        indirect_hbac=100,  # groups, hostgroups
        hbac_per_user=5,
        hbac_per_group=2,
        hbac_per_host=5,
        hbac_per_hostgroup=5

- The original testcase is N*5 clients launch the command 'id <user>'
  Most of the time the clients are timing out and it is difficult to isolate the specific load
  So I isolated the new testcase that is trying to mimic the LDAP requests send by SSSD to do a single 'id <user>'
  The main requests are:
    - for the <user>, SRCH + deref of the 'memberof' groups that user is memberof
    - for each group that the <user> is memberof, SRCH + deref of the 'member' of the group

- With the provisioned database, <user> is member of 20 usergroups. In average each usergroup has 800 direct members.
  It triggers 20*800 internal SRCH (scope=0) and the command last 4.5s (~3500 SRCH/s).
  Note that the direct SRCH are done while bound as 'cn=directory manager' so there is no ACI impact on returned attributes.

- Several SRCH retrieved the same entry.
  If <user> belongs to the N usergroups and each of them has <other_user> as member, then <other_user> is SRCH N times.
  It recall the problematic of memberof updates where same intermediates nodes were looked up several times.

In conclusion:
according to the above tests, the number of internal searches is high. They contribute to the high etime.
IMHO it is likely they are the main contributer of high etime.

I am not sure I like the idea of another cache, here are some thoughts for other approaches, where in my problem the core problem is fast access to the (dereferenced) entries.

  • if using a cache I bring back of having a group/member cache which could also be used for memberof or referential integrity or ...

  • another approach is to shortcut the internal searches, I always wanted to have a more direct access to the databse than the overhead of internal searches with pblocks, plugin calls,...... Can't we have
    slapi_get_entry(dn, &entry) and just retrieve the entry and use the entyr cahce for caching

Attached test scenario to mimic 'id <user>'

sssd_load.sh

@lkrispen, when opening the ticket I remembered the discussion of having a common group/member cache that could be used by several plugins. In the specific case of deref, the goup/member cache may be not enough as deref returns attributes values. So basically the cache should also contain some (all?) attributes and at the end of the day it is like the entry cache.

I like the idea of having direct access to the entry cache instead of deref own plugin cache. It should offer a mechanism to keep an entry in the cache so that it does not get freed under us. Two concerns using entry cache are: entry memory footprint is larger than strictly required by deref, the status of entry cache is fragile as it may become obsolete with new DB page format.

@lkrispen, when opening the ticket I remembered the discussion of having a common group/member cache that could be used by several plugins. In the specific case of deref, the goup/member cache may be not enough as deref returns attributes values. So basically the cache should also contain some (all?) attributes and at the end of the day it is like the entry cache.

I would not store copies of entries, just references to the entryid and have a fast access to the entry by its id, it can then be in the entry cache or loaded on demand. If you have a separate cache for deref you have the problem to keep it consistent, the deref plugin would now have to intercet all mod/add/del/modrdn oerations to catch changes affetcting the cache, this would add overhead to other operations.

I like the idea of having direct access to the entry cache instead of deref own plugin cache. It should offer a mechanism to keep an entry in the cache so that it does not get freed under us. Two concerns using entry cache are: entry memory footprint is larger than strictly required by deref, the status of entry cache is fragile as it may become obsolete with new DB page format.

1) I think the deref cach would be added and for loading and modifying it would need to read them into the entry cache first, so I think the footprint would increase
2) in my opinion there will always be a layer to find an entyr by dn or id, the format of the entry representation may change, but some kind of ec will be there - an we are still away from this

Regarding the overhead of deref cache consistency, I was thinking to something quite simple.
add callbacks to post_txnbe add/mod/del/modrdn. remove the target entry from cache(deref cache) without evaluating the OP. So if we later need the value of that entry, we do an internal search and refill the cache.

Regarding the use of the entry cache itself, as I said if we agree to use it I like the idea.

Metadata Update from @mreynolds:
- Issue set to the milestone: 1.4.0

a year ago

You pointed out the issue already:

- for each group that the <user> is memberof, SRCH + deref of the 'member' of the group

On the surface this ticket seems simple and reasonable. Make deref faster. However, in isolation this is not exposing the reality of the issue. In real world server deployments of FreeIPA they have an "all users" group, and many other places such as my former workplace contained groups of all staff and students. As a result, when I would login via sssd at UofA, it would take me 10 minutes for SSSD to complete the single level of enumeration of every single staff and student, past and present.

This is not a flaw of deref being slow, this is because the behaviour and expectation of the client is not in sync with the offered behaviour of directory server. It is not reasonable to expect queries where you deref 1 group mapping to 100,000 users to really be fast both in terms of search, but also paging of results, client processing time and more. followed by dereferencing all 100,000 users in their own queries to find their groups. This would merely return 100,000 copies of the "all users group", and effectively every group that exists on the installation that has a member.

This has turned 1 login of 1 user into 200,000 entries being sent and in excess of 100,000 queries being performed. If you have n users in y groups, then you now have y*n lookups.

I have been discussing this issue for nearly 5 years now, and I'm really sad to see it being pushed onto us as "our problem". We should be working with the SSSD team to come up with better solutions rather than bandaids in our server to mask bad behaviour.

The truth is no amount of cache, improvement or anything can solve the fact that they are attempting to search/map (almost) the entire directory. There must be a better way. At the very least, if the process is only:

- for the <user>, SRCH + deref of the 'memberof' groups that user is memberof

Without the second deref for all group members, that would prevent part of the issue. iirc that's what "ignore_group_members" does.

  • why not enable ignore_group_members by default?
  • why not have it so that when another user logs in you can then amend the local group cache with their details as well if they are in the same group if ignore_group_members is true?
  • why not flag group lookups - vs user lookups differently so that sssd can choose when is safe to deref groups back to their users?
  • why not remove the "all users" group from freeipa all together to remove the problem (or at least mask it from being deref) so that the entire directory isn't walked?
  • why make sssd the repository of all knowledge when clients would be better bypassing it and getting limited scopes of necessary data from DS directly? (py-libipa we a dsldapobjects interface anyone?)

I think these are all better solutions than us trying to fix an unfixable problem.

Conclusion: yes we can make deref a "bit faster" as the expense of huge complexity and time, but a better solution is for us to help communicate and work with sssd to improve their client behaviour to match the expectations and data models of LDAP.

As a final point - on further consideration there is no reason not to use ignore_group_members by default. IPA uses groups as role based access control which means we only care about the forward though "Does user X have role Y?". We do not need to understand "Does role Y have user X?".

As a result, the deref of users -> group is reason able, and turning on ignore_group_members is safe as we still get "does user X have role Y". There is no situation where we require the reverse knowledge for a reasonable implementation on a unix system because we always have this data, and no situation where the reverse is needed from a uid/gid standpoint.

@firstyear, SSSD will benefit from improvement of deref plugin but any other application will also benefit. The use of internal searches to fulfill a deref control is a choice that does not scale. SSSD with ignore_group_members=no will benefit of it but it will not be the only one.

I agree that ignore_group_members=No is a proposed tuning of several SSSD deployment when it creates perf issue. It was identified several years ago and I do not know why it is not the default. @jhrozek , should we open a SSSD ticket for this ?

(I have no knowledge about the DS internals, I will just comment about the client behaviour)

The typical use-case that triggers the expensive look ups is running "id user". Just as a reminder this is how its output looks like:

$ id u1
uid=1555600019(u1) gid=1555600019(u1) groups=1555600019(u1),1555600016(gr1),1555600018(gr3),1555600017(gr2)

What does this call do? From the user's perspective, it fetches the groups the user is a member of.

From the system's perspective, id does several calls to the Name Service Switch (NSS) subsystem which then calls to SSSD which does the LDAP calls. And unfortunately, especially for group-related calls, there is no way to know what is being asked. Let me explain in more detail.

Running "id u1" will translate into several NSS calls:
- getpwnam(u1) - this returns "struct passwd" about u1. From SSSD's perspective, this is quite fast, it's one search
- initgroups(u1) - this returns the numerical GIDs the user is a member of. From SSSD's perspective, this is one deref to get the group objects and then the client reads their gidNumber attributes and returns them. Note that the important thing is that now you have the numerical GIDs but the "id" command wants to return group names. And because the only API Linux has for returning group information based on GID is getgrgid, what id then does is:
- for gid in initgroups_result { getgrgid(gid) }

This is what's super expensive. Check out the man page of getgrgid ,the result is struct group which /also contains the members/. So the id command wants to just translate the numerical GID to a string name, but in addition to the group name, it also gets the group members. And potentially any nested group members.

Unfortunately, libc doesn't allow any group_name_by_gid command nor does it let you specify what kind of information you want to get with flags. And because SSSD has no idea who is running the getgrgid() command, it just really returns all the information.

As @firstyear says, this is a pain point for many years for SSSD, but because SSSD doesn't know if the getgr* request comes from someone doing "ls -l" on a file owned by group "all_students" where you don't need the members of the groups resolved or of the admin really ran "getent group all_students" and wants to enumerate the group members, it always returns all the information.

There were some ideas floating around:
- make ignore_group_members=True the default. I honestly think it's a good idea, because except for running "getent group" on the command line, there is no reason you typically care about the group members..except..some people are stupid and have apps that do permission checks on getgr* output instead of initgroups output. There were also support calls from admins who flipped ignore_group_members to True and then came back complaining that "all the groups show empty now". So we never did this change because it breaks expectations that admins have
- add a group_name_by_gid call to libc. Convert apps that only care about group name to use group_name_by_gid. This would be IMO the most systematic choice, but it's such an uphill battle that we never even started
- (warning, super hacky idea, please don't read unless you sit comfortably). SSSD knows the PID of the peer who is asking the info. Read /proc/$PID/exec, if the peer is "ls", "id" or some other app that we know doesn't care about group members, make an internal request that doesn't read the members.

I hope this comment explains why does SSSD do the seemingly stupid thing of downloading the whole group when all the app cares about is gid-to-name translation.

I'm really open to ideas, but as long SSSD lookups are triggered by the NSS interface, I'm afraid our hands are tied to a large extent.

  • make ignore_group_members=True the default. I honestly think it's a good idea, because except for running "getent group" on the command line, there is no reason you typically care about the group members..except..some people are stupid and have apps that do permission checks on getgr* output instead of initgroups output. There were also support calls from admins who flipped ignore_group_members to True and then came back complaining that "all the groups show empty now". So we never did this change because it breaks expectations that admins have

I do not believe that we should do huge amounts of work in order to support admins who are too lazy to change bad behaviour. If you think it's a good idea, then we should do it. It's the path of least resistance, solves a huge known performance issue, and means we have a simple avenue to pursue to resolution of this matter. Groups should be empty, and it's that "users are members" not "groups have members" that we should be considering.

  • add a group_name_by_gid call to libc. Convert apps that only care about group name to use group_name_by_gid. This would be IMO the most systematic choice, but it's such an uphill battle that we never even started

I think today it's not worth it. Linux machines are "single application" affairs, and even containers less and less care about this data. I think that we should just not bother. Again, remember, we generally don't care about "In this group, is there a member named X?" as much as "X has group list of A, B, C".

  • (warning, super hacky idea, please don't read unless you sit comfortably). SSSD knows the PID of the peer who is asking the info. Read /proc/$PID/exec, if the peer is "ls", "id" or some other app that we know doesn't care about group members, make an internal request that doesn't read the members.

Owwwwwww this hurts me :) Please don't do this.

I hope this comment explains why does SSSD do the seemingly stupid thing of downloading the whole group when all the app cares about is gid-to-name translation.
I'm really open to ideas, but as long SSSD lookups are triggered by the NSS interface, I'm afraid our hands are tied to a large extent.

Our hands are tied to an extent, but ignore_group_members=True exists today and resolves the issue. A simple way to extend this is that when SSSD does the id(u1) lookup it templates out the groups say "g1, g2, ...". but DONT retrieve the members of the group, just leave them as only have u1 as their only member. Now only when you do the gid(g1) request, retrieve the group members, but don't deref the users - only list users who have had the id(X) call applied to them.

Then when u2 logs in, if they are also a member of g1, sssd can amend locally it's knowledge of g1 to also have u2 - similar if u2 logs in and g1 was taken away, this can also be amended.

This could be an interesting middle ground to the problem as you then get groups listing their "partial" membersets, and you still avoid huge derefs.

EDIT: Forgot to say: I don't think we should continue to be bound by legacy ideas in this space, because today we need fresh, challenging ideas and a willingness to make changes and cause some pain for long term success. If we continue to try and support every ancient unix ideal about groups/users, we'll never be able to adapt to modern idm designs and principles. If people want to stick to that behaviour, they should do something else. Part of the benefit of freeipa is that is is opinionated and says "here is how it should be". We need to use those opinions here I think :)

I hope this helps,

@firstyear, SSSD will benefit from improvement of deref plugin but any other application will also benefit. The use of internal searches to fulfill a deref control is a choice that does not scale. SSSD with ignore_group_members=no will benefit of it but it will not be the only one.

I tend to agree with firstyear, that deref is already an optimization for clients which otherwise would have to do tons of external seearches to get the same information, the client approach does not scale, and we could only try to improve deref handling.
But th eproposed cache solution also has more problems to solve. for deref we need to evaluate aci if the client looking up the group is allowed to see the deref'ed attributes, so the cach would have to handle that as well

@lkrispen I agree that it will improve, but it's impossible to solve the issue that triggered the creation of the ticket. Improvement of deref is not the solution or the problem, the problem is the exponential mapping behaviour of SSSD that is triggering derefs. As a result, it's easy to blame the 389 server, and to make tickets in isolation like this that are ignoring the true causes of the issue.

So while improving deref would be nice, any work we do will not improve the situation meaningfully.

For example, even if made it so that deref searches went from say ... 0.5 second to 0.25 second. And lets assume it takes 0.01 to send an entry to the client. 50% improvement (which is nearly impossible to achieve in reality ...). SSSD is still going to lookup my user and deref that to groups (1 deref to say ... 20 groups), and then it will deref all the members of those groups to their users - in a large env lets call this 100,000 users. So now we have:

Before:

 (0.5 +  (21 * 0.01)) + ((20 * 0.5) + (100,000 * 0.01)) = 1010.71 seconds

After

(0.25 +  (21.0 * 0.01)) + ((20.0 * 0.25) + (100000.0 * 0.01)) = 1005.46 seconds

So while deref is faster, we still have a cost to send entries (or any other overhead). Yes these are rough numbers, and we could make them more accurate. But this is highlighting that the issue is not deref, but the expectation that dereferencing small numbers of groups to thousands of user accounts (often the entire directories user account set) is not feasible.

We may as well just invest in making the search server faster, or entry transmission faster, or connection handling faster - all of them will contribute to making this faster. But no matter how much work we do, it will not fix the behaviour of SSSD, which is doing something that is not possible to be made "efficient" at scale. We should not be adding "complex solutions" into our server just to handle bad client behaviour, when the complexity won't have a large impact on the issue.

As a result, I think it would be better for everyones time to examine solutions within SSSD - as @jhrozek has already mentioned, ignore_group_members = True by default for example.

I have a mixed feeling regarding the last updates.

First thing, I agree that it is not good to make DS more complex to satisfy not rational client needs. In the sssd example, @jhrozek agreed it makes no real sense to fetch group members and ignore_group_members=True is a known workaround for years now. So fixing SSSD (ignore_group_member=True being the default) looks the right approach.

Now, deref is a supported control and IMHO it is a nice one, it is faster for DS to process 1000 internal lookups than to receive 1000 SRCH. The way it is implemented is not optimal. Although the gain computed by @firstyear looks minor, the benefit seen with memberof improvement (replace internal searches with cache) made the memberof plugin up to 20 times faster (https://pagure.io/389-ds-base/issue/49031). So to be sure of potential gain I would prefer a prototype. Internal searches are a known performance killer and in case of deref it is the killer.

This ticket is to evaluate how to make deref faster and there are ideas on the table to improve it.

Should it be a priority (considering complexity and risk) is an other aspect and I agree with your arguments it is no longer a priority, at the condition SSSD stop using deref to fetch group members.

In my opinion we should not develop a specific solution for the deref plugin, the problem are the internal searches and if we can improve them the deref plugin will automatically benefit.

I will repeat my proposal of an new api function like: slapi_get_entry(backend, dn) (params to be decided, eg if pblock is needed)
This will not depend on a specific backend implementation or on the entry cache, it will lookup the id for a dn and read it from id2entry, if ther is an entry cache layer: fine, if not it still will avoid teh overhead of an internal search

I noticed that my last comment was a bit unclear. When I said "improve internal searches" I did not mean slapi_search_internal() but the more general task of internally finding an entry, and it could be done by slapi_search_internal or a new function like slapi_get_entry()

@lkrispen as said I like the idea of having a direct access to an entry (slapi_get_entry) via entry cache (+-db access).

I think deref could benefit of it (+aci_access_allowed) as a replacement of internal search, rather that implementing its own cache.
IMHO although deref improvement is not a priority, it worthes opening a new RFE to implement slapi_get_entry).

There is certainly some overheads in the design of internal searches which seems to be legacy from a model of "untrusted plugins" and attempting to authenticate these. Saying this, improvements to search and internal searches I think is the right approach here as that will benefit the entire server, not just deref. So I think this comes back to "make the search faster".

So as an "action plan" do we think that @jhrozek should take back to SSSD/IPA that ignore_group_members=True is a default, and that we should improve internal search "broadly"?

As part of this, I have some ideas on profiling to share with you @tbordaz about how we can assess and continue to improve our search performance. I think we really need to improve our profiling and introspection first so we can make informed decisions. How does that sound for an approach?

So as an "action plan" do we think that @jhrozek should take back to SSSD/IPA that ignore_group_members=True is a default, and that we should improve internal search "broadly"?

I did bring this up on both sssd-devel and our internal team list and all I hear is crickets.

@jhrozek, my understanding is that the only reason we want to keep ignore_group_members=False is that 'getent group' is supposed to return informations about the members. Do you think getent could use an environment variable that SSSD would consum to decide 'deref or not deref'

@firstyear, working of profiling is something I have been hoping for years !
I was thinking of defining stap probes at specific places + specific stap scripts to display them.
It would worth a separated ticket

@jhrozek, my understanding is that the only reason we want to keep ignore_group_members=False is that 'getent group' is supposed to return informations about the members. Do you think getent could use an environment variable that SSSD would consum to decide 'deref or not deref'

This is maybe 50% of the use-cases. The other 50% is (arguably broken) applications that use the output of getgrnam or similar for access control. The variable would have to be set in the context of whatever is running getpwnam, so typically in the shell. Then we'd have to modify nss_sss to honour this variable value and send a different kind of request. So, doable, but not trivial.

I could swear that polkit was one of the offenders at least back in the day, but at the same time I can't find the bugzilla, so maybe (hopefully!) I'm wrong.

I did bring this up on both sssd-devel and our internal team list and all I hear is crickets.

This is generally the reaction, yes :) but it's not a "no" ...

working of profiling is something I have been hoping for years !

Great! I think we should open that ticket and define what we want to be able to achieve and retrieve from profiling, as well as the use cases where we might run it.

@lkrispen as said I like the idea of having a direct access to an entry (slapi_get_entry) via entry cache (+-db access).
I think deref could benefit of it (+aci_access_allowed) as a replacement of internal search, rather that implementing its own cache.
IMHO although deref improvement is not a priority, it worthes opening a new RFE to implement slapi_get_entry).

Opened https://pagure.io/389-ds-base/issue/50030: RFE create a new interface slapi_get_entry to get a Slapi_entry from a DN

@tbordaz Sounds good. If you have time can you check my (now old) slapi_v4 patch to see if I added similar there? I know I played with the slapi_ entry interfaces a bit in that work.

@tbordaz Sounds good. If you have time can you check my (now old) slapi_v4 patch to see if I added similar there? I know I played with the slapi_ entry interfaces a bit in that work.

good point, I was also thinking if this new functionality could be part of a new api version.

But I lost the state of your slapi_v4 work, could you add som elinks to docs and patches ? thanks.
And mybe better do it in #50030

I will try and update this soon when I get some time :)

Login to comment on this ticket.

Metadata
Attachments 1
Attached a year ago View Comment