#3118 using overides causes segfault in libldb
Closed: Fixed None Opened 7 years ago by rrigby.

testing sssd with overrides (users and groups) on centos 7.2 (sssd 1.13.0, libldb 1.1.25), i soon ran in to problems with sssd_nss crashing. dmesg shows:

sssd_nss[28935]: segfault at 51 ip 00007fa5e39d46af sp 00007ffcd6f18290 error 4 in libldb.so.1.1.25[7fa5e39c4000+2d000]

backtrace:

  Program received signal SIGSEGV, Segmentation fault.
  0x00007f4b514276af in ldb_dn_from_ldb_val (mem_ctx=mem_ctx@entry=0x7f4b52297300, ldb=0x7f4b5229dad0, strdn=0x25) at ../common/ldb_dn.c:97
  97              if (strdn && strdn->data
  (gdb) bt
  #0  0x00007f4b514276af in ldb_dn_from_ldb_val (mem_ctx=mem_ctx@entry=0x7f4b52297300, ldb=0x7f4b5229dad0, strdn=0x25) at ../common/ldb_dn.c:97
  #1  0x00007f4b5188109a in sysdb_add_group_member_overrides (domain=domain@entry=0x7f4b52296120, obj=0x7f4b522a3e40) at src/db/sysdb_views.c:1308
  #2  0x00007f4b5187373c in sysdb_getgrgid_with_views (mem_ctx=mem_ctx@entry=0x7f4b52295ea0, domain=domain@entry=0x7f4b52296120, gid=65751, res=res@entry=0x7f4b522a3260) at src/db/sysdb_search.c:659
  #3  0x00007f4b51ef292c in nss_cmd_getgrgid_search (dctx=dctx@entry=0x7f4b522a3240) at src/responder/nss/nsssrv_cmd.c:3349
  #4  0x00007f4b51ef672d in nss_cmd_getbyid (cmd=<optimized out>, cctx=0x7f4b522a0900) at src/responder/nss/nsssrv_cmd.c:1975
  #5  0x00007f4b51f01b2e in client_cmd_execute (sss_cmds=0x7f4b521182e0 <nss_cmds>, cctx=0x7f4b522a0900) at src/responder/common/responder_common.c:249
  #6  client_recv (cctx=0x7f4b522a0900) at src/responder/common/responder_common.c:283
  #7  client_fd_handler (ev=<optimized out>, fde=<optimized out>, flags=<optimized out>, ptr=<optimized out>) at src/responder/common/responder_common.c:335
  #8  0x00007f4b4e15bd0b in epoll_event_loop_once () from /lib64/libtevent.so.0
  #9  0x00007f4b4e15a1d7 in std_event_loop_once () from /lib64/libtevent.so.0
  #10 0x00007f4b4e15636d in _tevent_loop_once () from /lib64/libtevent.so.0
  #11 0x00007f4b4e15650b in tevent_common_loop_wait () from /lib64/libtevent.so.0
  #12 0x00007f4b4e15a177 in std_event_loop_wait () from /lib64/libtevent.so.0
  #13 0x00007f4b51891553 in server_loop (main_ctx=0x7f4b5228d2a0) at src/util/server.c:668
  #14 0x00007f4b51eeaf77 in main (argc=8, argv=<optimized out>) at src/responder/nss/nsssrv.c:626

running sssd with -d 9, and having added ldb tracing to src/db/sysdb.c (line 59 in the unpatched source):

ret = ldb_connect(ldb, filename, LDB_FLG_ENABLE_TRACING, NULL);

things die just after retrieving the override information for a member of a group, e.g. (username/domain name removed):

  [sssd[nss]] [ldb] (0x4000): ldb_trace_response: ENTRY
  dn: overrideAnchorUUID=:LOCAL:name\3Dusername\,cn\3Dusers\,cn\3Ddomainname\,cn\3Dsysdb,cn=LOCAL,cn=views,cn=sysdb
  loginShell: /bin/tcsh
  name: username
  objectClass: userOverride
  overrideObjectDN: name=username,cn=users,cn=domainname,cn=sysdb
  uidNumber: 6272

  [sssd[nss]] [ldb] (0x4000): Destroying timer event 0x7f4b522a3cc0 "ltdb_timeout"
  [sssd[nss]] [ldb] (0x4000): Ending timer event 0x7f4b522ae710 "ltdb_callback"
  [sssd[nss]] [sysdb_add_group_member_overrides] (0x4000): Added [username] to [overridememberUid].

it always seems to fail on the first or second member of the group, and it is always when the group being looked at has overrides (gid).

when dropping back to the previous ldb packages for centos 7.2 (ldb 1.1.20) everything seems to work just fine, so i looked at the differences, and it seems that this change, which was added in ldb 1.1.24 might be significant:

  --- ldb-1.1.20/ldb_tdb/ldb_search.c     2014-09-16 19:04:31.000000000 +0100                    
  +++ ldb-1.1.25/ldb_tdb/ldb_search.c     2015-12-10 11:01:40.000000000 +0000                    
  @@ -407,10 +407,18 @@                                                                          
          }

          talloc_free(msg->elements);                                                            
  -       msg->elements = talloc_realloc(msg, el2, struct ldb_message_element, msg->num_elements);
  +                                                                                               
  +       if (num_elements > 0) {                                                                 
  +               msg->elements = talloc_realloc(msg, el2, struct ldb_message_element,            
  +                                              num_elements);                                   
  +       } else {                                                                                
  +               msg->elements = talloc_array(msg, struct ldb_message_element, 0);               
  +               talloc_free(el2);                                                               
  +       }                                                                                       
          if (msg->elements == NULL) {                                                            
                  return -1;                                                                      
          }

reverting this changes stops things from crashing, as does just adding 1 to num_elements in the talloc_realloc call, e.g.:

  --- ldb-1.1.25/ldb_tdb/ldb_search.c     2015-12-10 11:01:40.000000000 +0000
  +++ ldb-1.1.25.test/ldb_tdb/ldb_search.c        2016-08-02 16:37:01.823488833 +0100
  @@ -410,7 +410,7 @@

          if (num_elements > 0) {
                  msg->elements = talloc_realloc(msg, el2, struct ldb_message_element,
  -                                              num_elements);
  +                                              num_elements+1);

i have had a bit of a poke around, but can't say i have been able to work out exactly why this is the case ...

i would like to have been able to give a better report of the exact cause of the problem, but have unfortunately run out of time to look at this for now.

at the moment, i can stick with ldb-1.1.20, but that's not really a long term solution. i did also do some quick testing with sssd-1.14.0, and the problem remains.

let me know if i can provide any more information.

thanks,

richard


Fields changed description: testing sssd with overrides (users and groups) on centos 7.2 (sssd 1.13.0, libldb 1.1.25), i soon ran in to problems with sssd_nss crashing. dmesg shows: sssd_nss[28935]: segfault at 51 ip 00007fa5e39d46af sp 00007ffcd6f18290 error 4 in libldb.so.1.1.25[7fa5e39c4000+2d000] backtrace: Program received signal SIGSEGV, Segmentation fault. 0x00007f4b514276af in ldb_dn_from_ldb_val (mem_ctx=mem_ctx@entry=0x7f4b52297300, ldb=0x7f4b5229dad0, strdn=0x25) at ../common/ldb_dn.c:97 97 if (strdn && strdn->data (gdb) bt #0 0x00007f4b514276af in ldb_dn_from_ldb_val (mem_ctx=mem_ctx@entry=0x7f4b52297300, ldb=0x7f4b5229dad0, strdn=0x25) at ../common/ldb_dn.c:97 #1 0x00007f4b5188109a in sysdb_add_group_member_overrides (domain=domain@entry=0x7f4b52296120, obj=0x7f4b522a3e40) at src/db/sysdb_views.c:1308 #2 0x00007f4b5187373c in sysdb_getgrgid_with_views (mem_ctx=mem_ctx@entry=0x7f4b52295ea0, domain=domain@entry=0x7f4b52296120, gid=65751, res=res@entry=0x7f4b522a3260) at src/db/sysdb_search.c:659 #3 0x00007f4b51ef292c in nss_cmd_getgrgid_search (dctx=dctx@entry=0x7f4b522a3240) at src/responder/nss/nsssrv_cmd.c:3349 #4 0x00007f4b51ef672d in nss_cmd_getbyid (cmd=<optimized out>, cctx=0x7f4b522a0900) at src/responder/nss/nsssrv_cmd.c:1975 #5 0x00007f4b51f01b2e in client_cmd_execute (sss_cmds=0x7f4b521182e0 <nss_cmds>, cctx=0x7f4b522a0900) at src/responder/common/responder_common.c:249 #6 client_recv (cctx=0x7f4b522a0900) at src/responder/common/responder_common.c:283 #7 client_fd_handler (ev=<optimized out>, fde=<optimized out>, flags=<optimized out>, ptr=<optimized out>) at src/responder/common/responder_common.c:335 #8 0x00007f4b4e15bd0b in epoll_event_loop_once () from /lib64/libtevent.so.0 #9 0x00007f4b4e15a1d7 in std_event_loop_once () from /lib64/libtevent.so.0 #10 0x00007f4b4e15636d in _tevent_loop_once () from /lib64/libtevent.so.0 #11 0x00007f4b4e15650b in tevent_common_loop_wait () from /lib64/libtevent.so.0 #12 0x00007f4b4e15a177 in std_event_loop_wait () from /lib64/libtevent.so.0 #13 0x00007f4b51891553 in server_loop (main_ctx=0x7f4b5228d2a0) at src/util/server.c:668 #14 0x00007f4b51eeaf77 in main (argc=8, argv=<optimized out>) at src/responder/nss/nsssrv.c:626 running sssd with -d 9, and having added ldb tracing to src/db/sysdb.c (line 59 in the unpatched source): ret = ldb_connect(ldb, filename, LDB_FLG_ENABLE_TRACING, NULL); things die just after retrieving the override information for a member of a group, e.g. (username/domain name removed): [sssd[nss]] [ldb] (0x4000): ldb_trace_response: ENTRY dn: overrideAnchorUUID=:LOCAL:name\3Dusername\,cn\3Dusers\,cn\3Ddomainname\,cn\3Dsysdb,cn=LOCAL,cn=views,cn=sysdb loginShell: /bin/tcsh name: username objectClass: userOverride overrideObjectDN: name=username,cn=users,cn=domainname,cn=sysdb uidNumber: 6272 [sssd[nss]] [ldb] (0x4000): Destroying timer event 0x7f4b522a3cc0 "ltdb_timeout" [sssd[nss]] [ldb] (0x4000): Ending timer event 0x7f4b522ae710 "ltdb_callback" [sssd[nss]] [sysdb_add_group_member_overrides] (0x4000): Added [username] to [overridememberUid]. it always seems to fail on the first or second member of the group, and it is always when the group being looked at has overrides (gid). when dropping back to the previous ldb packages for centos 7.2 (ldb 1.1.20) everything seems to work just fine, so i looked at the differences, and it seems that this change, which was added in ldb 1.1.24 might be significant: --- ldb-1.1.20/ldb_tdb/ldb_search.c 2014-09-16 19:04:31.000000000 +0100 +++ ldb-1.1.25/ldb_tdb/ldb_search.c 2015-12-10 11:01:40.000000000 +0000 @@ -407,10 +407,18 @@ } talloc_free(msg->elements); - msg->elements = talloc_realloc(msg, el2, struct ldb_message_element, msg->num_elements); + + if (num_elements > 0) { + msg->elements = talloc_realloc(msg, el2, struct ldb_message_element, + num_elements); + } else { + msg->elements = talloc_array(msg, struct ldb_message_element, 0); + talloc_free(el2); + } if (msg->elements == NULL) { return -1; } reverting this changes stops things from crashing, as does just adding 1 to num_elements in the talloc_realloc call, e.g.: --- ldb-1.1.25/ldb_tdb/ldb_search.c 2015-12-10 11:01:40.000000000 +0000 +++ ldb-1.1.25.test/ldb_tdb/ldb_search.c 2016-08-02 16:37:01.823488833 +0100 @@ -410,7 +410,7 @@ if (num_elements > 0) { msg->elements = talloc_realloc(msg, el2, struct ldb_message_element, - num_elements); + num_elements+1); i have had a bit of a poke around, but can't say i have been able to work out exactly why this is the case ... i would like to have been able to give a better report of the exact cause of the problem, but have unfortunately run out of time to look at this for now. at the moment, i can stick with ldb-1.1.20, but that's not really a long term solution. i did also do some quick testing with sssd-1.14.0, and the problem remains. let me know if i can provide any more information. thanks, richard => testing sssd with overrides (users and groups) on centos 7.2 (sssd 1.13.0, libldb 1.1.25), i soon ran in to problems with sssd_nss crashing. dmesg shows: sssd_nss[28935]: segfault at 51 ip 00007fa5e39d46af sp 00007ffcd6f18290 error 4 in libldb.so.1.1.25[7fa5e39c4000+2d000] backtrace: {{{ Program received signal SIGSEGV, Segmentation fault. 0x00007f4b514276af in ldb_dn_from_ldb_val (mem_ctx=mem_ctx@entry=0x7f4b52297300, ldb=0x7f4b5229dad0, strdn=0x25) at ../common/ldb_dn.c:97 97 if (strdn && strdn->data (gdb) bt #0 0x00007f4b514276af in ldb_dn_from_ldb_val (mem_ctx=mem_ctx@entry=0x7f4b52297300, ldb=0x7f4b5229dad0, strdn=0x25) at ../common/ldb_dn.c:97 #1 0x00007f4b5188109a in sysdb_add_group_member_overrides (domain=domain@entry=0x7f4b52296120, obj=0x7f4b522a3e40) at src/db/sysdb_views.c:1308 #2 0x00007f4b5187373c in sysdb_getgrgid_with_views (mem_ctx=mem_ctx@entry=0x7f4b52295ea0, domain=domain@entry=0x7f4b52296120, gid=65751, res=res@entry=0x7f4b522a3260) at src/db/sysdb_search.c:659 #3 0x00007f4b51ef292c in nss_cmd_getgrgid_search (dctx=dctx@entry=0x7f4b522a3240) at src/responder/nss/nsssrv_cmd.c:3349 #4 0x00007f4b51ef672d in nss_cmd_getbyid (cmd=<optimized out>, cctx=0x7f4b522a0900) at src/responder/nss/nsssrv_cmd.c:1975 #5 0x00007f4b51f01b2e in client_cmd_execute (sss_cmds=0x7f4b521182e0 <nss_cmds>, cctx=0x7f4b522a0900) at src/responder/common/responder_common.c:249 #6 client_recv (cctx=0x7f4b522a0900) at src/responder/common/responder_common.c:283 #7 client_fd_handler (ev=<optimized out>, fde=<optimized out>, flags=<optimized out>, ptr=<optimized out>) at src/responder/common/responder_common.c:335 #8 0x00007f4b4e15bd0b in epoll_event_loop_once () from /lib64/libtevent.so.0 #9 0x00007f4b4e15a1d7 in std_event_loop_once () from /lib64/libtevent.so.0 #10 0x00007f4b4e15636d in _tevent_loop_once () from /lib64/libtevent.so.0 #11 0x00007f4b4e15650b in tevent_common_loop_wait () from /lib64/libtevent.so.0 #12 0x00007f4b4e15a177 in std_event_loop_wait () from /lib64/libtevent.so.0 #13 0x00007f4b51891553 in server_loop (main_ctx=0x7f4b5228d2a0) at src/util/server.c:668 #14 0x00007f4b51eeaf77 in main (argc=8, argv=<optimized out>) at src/responder/nss/nsssrv.c:626 }}} running sssd with -d 9, and having added ldb tracing to src/db/sysdb.c (line 59 in the unpatched source): ret = ldb_connect(ldb, filename, LDB_FLG_ENABLE_TRACING, NULL); things die just after retrieving the override information for a member of a group, e.g. (username/domain name removed): {{{ [sssd[nss]] [ldb] (0x4000): ldb_trace_response: ENTRY dn: overrideAnchorUUID=:LOCAL:name\3Dusername\,cn\3Dusers\,cn\3Ddomainname\,cn\3Dsysdb,cn=LOCAL,cn=views,cn=sysdb loginShell: /bin/tcsh name: username objectClass: userOverride overrideObjectDN: name=username,cn=users,cn=domainname,cn=sysdb uidNumber: 6272 [sssd[nss]] [ldb] (0x4000): Destroying timer event 0x7f4b522a3cc0 "ltdb_timeout" [sssd[nss]] [ldb] (0x4000): Ending timer event 0x7f4b522ae710 "ltdb_callback" [sssd[nss]] [sysdb_add_group_member_overrides] (0x4000): Added [username] to [overridememberUid]. }}} it always seems to fail on the first or second member of the group, and it is always when the group being looked at has overrides (gid). when dropping back to the previous ldb packages for centos 7.2 (ldb 1.1.20) everything seems to work just fine, so i looked at the differences, and it seems that this change, which was added in ldb 1.1.24 might be significant: {{{ --- ldb-1.1.20/ldb_tdb/ldb_search.c 2014-09-16 19:04:31.000000000 +0100 +++ ldb-1.1.25/ldb_tdb/ldb_search.c 2015-12-10 11:01:40.000000000 +0000 @@ -407,10 +407,18 @@ } talloc_free(msg->elements); - msg->elements = talloc_realloc(msg, el2, struct ldb_message_element, msg->num_elements); + + if (num_elements > 0) { + msg->elements = talloc_realloc(msg, el2, struct ldb_message_element, + num_elements); + } else { + msg->elements = talloc_array(msg, struct ldb_message_element, 0); + talloc_free(el2); + } if (msg->elements == NULL) { return -1; } }}} reverting this changes stops things from crashing, as does just adding 1 to num_elements in the talloc_realloc call, e.g.: {{{ --- ldb-1.1.25/ldb_tdb/ldb_search.c 2015-12-10 11:01:40.000000000 +0000 +++ ldb-1.1.25.test/ldb_tdb/ldb_search.c 2016-08-02 16:37:01.823488833 +0100 @@ -410,7 +410,7 @@ if (num_elements > 0) { msg->elements = talloc_realloc(msg, el2, struct ldb_message_element, - num_elements); + num_elements+1); }}} i have had a bit of a poke around, but can't say i have been able to work out exactly why this is the case ... i would like to have been able to give a better report of the exact cause of the problem, but have unfortunately run out of time to look at this for now. at the moment, i can stick with ldb-1.1.20, but that's not really a long term solution. i did also do some quick testing with sssd-1.14.0, and the problem remains. let me know if i can provide any more information. thanks, richard

I doubt that bug is in libldb.
There is much higher chance that bug is directly in sssd.

Could you try to reproduce with latest 1.13[1]? (1.14 might have some bugs. I would say there is a high change that bug is fixed in 1.13. And if it is not fixed then could you provide steps to reproduce crash?

[1] https://copr.fedorainfracloud.org/coprs/g/sssd/sssd-1-13/

cc: => lslebodn

thanks for getting back to me. i'm afraid i haven't had too much time for further testing.

unfortunately, the problem still exists in current 1.13.

to reproduce, this seems to work for me ... override a group:

   sss_override group-add group1@domain -g 5001

override a few users who are members of that group:

   sss_override user-add user1@domain -u 6001 -s /bin/tcsh
   sss_override user-add user2@domain -u 6002 -s /bin/tcsh
   sss_override user-add user3@domain -u 6003 -s /bin/tcsh

i'm not sure that it really matters which attributes are overwritten, this is just what i was testing when things started to fail.

then, to trigger the problem, something like:

  systemctl stop sssd
  \rm /var/lib/sss/mc/*
  systemctl start sssd
  groups user1
  systemctl stop sssd
  \rm /var/lib/sss/mc/*
  systemctl start sssd
  groups user2
  systemctl stop sssd
  \rm /var/lib/sss/mc/*
  systemctl start sssd
  groups user3

repeat as necessary. for me the problems seem to start once there are at least 2 memberuid entries for the group in the cache ldb file.

where the segmentation fault occurs (around sysdb_views.c:1307 in upatched 1.13.4 source):

    for (c = 0; c < members->num_values; c++) {
        member_dn = ldb_dn_from_ldb_val(tmp_ctx, domain->sysdb->ldb,
                                        &members->values[c]);

it seems to fail after a couple of iterations on the overridden group (e.g. c=2), at which point members has been overwritten / wiped out somewhere along the way.

that's about as far as i have managed to get just now ... if i have a bit of time over the weekend, i'll see if i can give it another look.

hope that's of some use, but let me know if you need any more details from me.

thanks,

richard

If you have a time then it will be good to try valgrind.

Add following line into nss section in sssd.conf

command = valgrind -v --log-file=/var/log/sssd/valgrind_nss_%p.log /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --debug-to-files

restart sssd (you might need to have SELinux in permissive mode) reproduce a crash and then provide valgrind log file.

thanks for the suggestion - valgrind log attached, and i think i can now see what is going on.

members hangs off obj.

during the members loop mentioned previously, there is a call to ldb_msg_add_string (line 1445 in unpatched 1.13.4 source):

  ret = ldb_msg_add_string(obj, OVERRIDE_PREFIX SYSDB_MEMBERUID, val);

which eventually reaches _ldb_msg_add_el in ldb_msg.c, where this happens:

  els = talloc_realloc(msg, msg->elements,
                       struct ldb_message_element, msg->num_elements + 1);

this reallocates obj, and invalidates members.

hope that's of some use, and thanks again for your help.

richard

BTW, I tested the patch a it works for me.
Thank you very much for time and you analysis.
I really appreciate it.

I am changing the title little bit. because the same crash can happen libldb-1.1.20. It worked for you only by a chance with that version. It depends on glibc/kernel whether it will move memory or just extend current block.

summary: using overides causes segfault in libldb > 1.1.23 => using overides causes segfault in libldb

Fields changed

owner: somebody => lslebodn
status: new => assigned

i applied the patch against 1.13.4, and ran the same tests which had reliably been causing the crashes, put it through a debugger, etc., added 'loads' of user and group overrides, set it off looking up information for 2000 or so users, and all seems to be working just fine - no issues or signs of any trouble at all.

so, looks good to me.

glad to have been of some use in tracking down the problem, and thanks for your help in working things out, and getting it fixed.

richard

Actually,
the crash is indirectly fixed in git master by commit 1594701

Should we still commit the patch to the stable branch, though?

glad to see things are already fixed in the repository.

for me, the fix is really required in the el7 packages ... shall i report the issue in the red hat bugzilla?

thanks,

richard

The bug will be fixed in RHEL-7.3, so I guess you should be good :) (Testing of the 7.3 packages would be mostly welcome in the meantime!)

that's great. thanks.

i'm quite happy to test the packages when available. i'm guessing that will be when 7.3 beta is released, unless i should also be looking somewhere else for the updated packages?

thanks again.

richard

Fields changed

milestone: NEEDS_TRIAGE => SSSD 1.13.5

I sent a backported version of the groupmembers override patch together with 2 fixes by Lukas to the list.

patch: 0 => 1

sssd-1-13:

resolution: => fixed
status: assigned => closed

Metadata Update from @rrigby:
- Issue assigned to lslebodn
- Issue set to the milestone: SSSD 1.13.5

7 years ago

SSSD is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in SSSD's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/SSSD/sssd/issues/4151

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Login to comment on this ticket.

Metadata