#1226 sssd_be crashes on shutdown
Closed: Fixed None Opened 8 years ago by sgallagh.

We have a memory hierarchy bug that is visible in the IPA provider but may also be present but dormant in all of the providers.

When we receive a SIGTERM, we invoke our graceful exit handler which starts talloc_freeing() the toplevel contexts, including the be_ctx.

The problem is that the be_ctx has two branches of children: the provider-specific data and any sbus_connections that may be currently active.

The problem occurs when talloc decides that it will free the provider-specific data before the sbus_connections are freed. It is possible in some circumstances for one or more destructors within the sbus_connection memory branch to be attempting to access the provider-specific data. When this happens, talloc calls abort() due to the access-after-free (on older talloc versions, it erroneously reported this as a double-free).

The proposed approach will be to allocate be_req atop the provider-specific data instead of directly on the sbus_connection. We will then add a talloc_spy to the sbus_connection so that if it freed (such as if the connection is dropped) it will explicitly call talloc_free() on the pending be_req.

In this way, whichever path is followed first at shutdown will still have a guarantee that the provider-specific data remains available until all pending requests have been safely cancelled.

#0  0x00000032fa832885 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
        resultvar = 0
        pid = 2067
        selftid = 2067
#1  0x00000032fa834065 in abort () at abort.c:92
        save_stage = 2
        act = {__sigaction_handler = {sa_handler = 0x7fff5ca91b30, sa_sigaction = 0x7fff5ca91b30}, sa_mask = {__val = {16, 21, 218993089241, 140734747974448, 218993085322, 140734747974768, 218993089483, 140733193388396, 218995357968,
              11, 13782216, 37, 218995358144, 13, 218980449216, 140734747974687}}, sa_flags = -401273743, sa_restorer = 0xffff}
        sigs = {__val = {32, 0 <repeats 15 times>}}
#2  0x00000032fc401ddc in talloc_abort (reason=0x32fc407bc0 "Bad talloc magic value - double free") at talloc.c:199
No locals.
#3  0x00000032fc401f18 in talloc_abort_double_free (ptr=<value optimized out>) at talloc.c:218
No locals.
#4  talloc_chunk_from_ptr (ptr=<value optimized out>) at talloc.c:239
        pp = <value optimized out>
        tc = <value optimized out>
#5  talloc_get_name (ptr=<value optimized out>) at talloc.c:937
        tc = 0x0
#6  0x00000032fc401f5e in talloc_check_name (ptr=0xa77330, name=0x7f3bd9f78131 "struct ipa_access_ctx") at talloc.c:956
        pname = <value optimized out>
#7  0x00007f3bd9f19e5c in hbac_sysdb_save (req=0x43ff3b10) at src/providers/ipa/ipa_access.c:427
        ret = <value optimized out>
        in_transaction = false
        hbac_ctx = 0xacc8e0
        domain = 0xa57900
        sysdb = <value optimized out>
        base_dn = <value optimized out>
        be_ctx = <value optimized out>
        access_ctx = <value optimized out>
        tmp_ctx = <value optimized out>
        __FUNCTION__ = "hbac_sysdb_save"
#8  0x0000003301c0447e in tevent_req_finish (req=<value optimized out>, error=<value optimized out>, location=<value optimized out>) at tevent_req.c:133
No locals.
#9  _tevent_req_error (req=<value optimized out>, error=<value optimized out>, location=<value optimized out>) at tevent_req.c:188
No locals.
#10 0x00007f3bd9f1f4ca in ipa_hbac_rule_info_done (subreq=<value optimized out>) at src/providers/ipa/ipa_hbac_rules.c:205
        ret = 5
        req = 0x43ff3b10
        state = <value optimized out>
        __FUNCTION__ = "ipa_hbac_rule_info_done"
#11 0x0000003301c0447e in tevent_req_finish (req=<value optimized out>, error=<value optimized out>, location=<value optimized out>) at tevent_req.c:133
No locals.
#12 _tevent_req_error (req=<value optimized out>, error=<value optimized out>, location=<value optimized out>) at tevent_req.c:188
No locals.
#13 0x00007f3bd9f3521a in sdap_get_generic_done (op=<value optimized out>, reply=0x0, error=5, pvt=<value optimized out>) at src/providers/ldap/sdap_async.c:932
        req = 0xd6d4ee0
        state = 0xacc010
        attrs = <value optimized out>
        errmsg = 0x0
        result = <value optimized out>
        ret = <value optimized out>
        lret = <value optimized out>
        total_count = <value optimized out>
        cookie = {bv_len = 36738272, bv_val = 0x7f3bd9f8bb38 "src/providers/ldap/sdap_fd_events.c:57"}
        returned_controls = 0x0
        page_control = <value optimized out>
        __FUNCTION__ = "sdap_get_generic_done"
#14 0x00007f3bd9f36515 in sdap_handle_release (mem=<value optimized out>) at src/providers/ldap/sdap_async.c:117
        op = 0x2434390
#15 sdap_handle_destructor (mem=<value optimized out>) at src/providers/ldap/sdap_async.c:94
        sh = 0x23b8890
#16 0x00000032fc402d9e in _talloc_free_internal (ptr=0x23b8890, location=0x32fc407b1d "talloc.c:1893") at talloc.c:600
        d = 0x7f3bd9f364a0 <sdap_handle_destructor>
        tc = 0x7f3bd9f364a0
#17 0x00000032fc402c2b in _talloc_free_internal (ptr=0x23b5f50, location=0x32fc407b1d "talloc.c:1893") at talloc.c:631
        child = 0x23b8890
        new_parent = 0x0
        tc = 0x23b8890
#18 0x00000032fc402c2b in _talloc_free_internal (ptr=0xa75190, location=0x32fc407b1d "talloc.c:1893") at talloc.c:631
        child = 0x23b5f50
        new_parent = 0x0
        tc = 0x23b5f50
#19 0x00000032fc402c2b in _talloc_free_internal (ptr=0xa72ea0, location=0x32fc407b1d "talloc.c:1893") at talloc.c:631
        child = 0xa75190
        new_parent = 0x0
        tc = 0xa75190
#20 0x00000032fc402c2b in _talloc_free_internal (ptr=0xa6e590, location=0x32fc407b1d "talloc.c:1893") at talloc.c:631
        child = 0xa72ea0
        new_parent = 0x0
        tc = 0xa72ea0
#21 0x00000032fc402c2b in _talloc_free_internal (ptr=0xa56ed0, location=0x32fc407b1d "talloc.c:1893") at talloc.c:631
        child = 0xa6e590
        new_parent = 0x0
        tc = 0xa6e590
#22 0x00000032fc402c2b in _talloc_free_internal (ptr=0xa545f0, location=0x32fc407b1d "talloc.c:1893") at talloc.c:631
        child = 0xa56ed0
        new_parent = 0x0
        tc = 0xa56ed0
#23 0x00000032fc402c2b in _talloc_free_internal (ptr=0xa53480, location=0x32fc407b1d "talloc.c:1893") at talloc.c:631
        child = 0xa545f0
        new_parent = 0x0
        tc = 0xa545f0
#24 0x00000032fc401abb in _talloc_free_internal (ptr=0xa532a0, location=0x32fc407b1d "talloc.c:1893") at talloc.c:631
        child = 0xa53480
        new_parent = 0x0
#25 _talloc_free (ptr=0xa532a0, location=0x32fc407b1d "talloc.c:1893") at talloc.c:1133
        tc = 0xa53480
#26 0x00000032fa835d92 in __run_exit_handlers (status=0) at exit.c:78
        atfct = <value optimized out>
        onfct = <value optimized out>
        cxafct = <value optimized out>
        f = <value optimized out>
#27 exit (status=0) at exit.c:100
No locals.
#28 0x0000000000436267 in sig_term (sig=<value optimized out>) at src/util/server.c:232
        done_sigterm = 0
        __FUNCTION__ = "sig_term"
#29 0x00007f3bd9f68dfc in krb5_finalize (ev=<value optimized out>, se=<value optimized out>, signum=15, count=<value optimized out>, siginfo=<value optimized out>, private_data=<value optimized out>)
    at src/providers/krb5/krb5_common.c:652
        realm = <value optimized out>
        ret = <value optimized out>
        __FUNCTION__ = "krb5_finalize"
#30 0x0000003301c03aac in tevent_common_check_signal (ev=0xa53480) at tevent_signal.c:343
        ofs = 0
        j = <value optimized out>
        se = 0xa73c30
        count = 1
        sl = <value optimized out>
        next = 0xa54410
        counter = {count = <value optimized out>, seen = 0}
        clear_processed_siginfo = <value optimized out>
        i = <value optimized out>
#31 0x0000003301c052f7 in std_event_loop_once (ev=0xa53480, location=<value optimized out>) at tevent_standard.c:528
        std_ev = 0xa53540
        tval = {tv_sec = 0, tv_usec = 0}
#32 0x0000003301c026d0 in _tevent_loop_once (ev=0xa53480, location=0x4446b5 "src/util/server.c:526") at tevent.c:490
        ret = <value optimized out>
        nesting_stack_ptr = 0x0
#33 0x0000003301c0273b in tevent_common_loop_wait (ev=0xa53480, location=0x4446b5 "src/util/server.c:526") at tevent.c:591
        ret = <value optimized out>
#34 0x0000000000436111 in server_loop (main_ctx=0xa545f0) at src/util/server.c:526
No locals.
#35 0x000000000040eeab in main (argc=6, argv=<value optimized out>) at src/providers/data_provider_be.c:1333
        opt = <value optimized out>
        pc = <value optimized out>
        be_domain = 0xa52490 "no.ep.corp.local"
        srv_name = <value optimized out>
        conf_entry = <value optimized out>
        main_ctx = 0xa545f0
        ret = 0
        long_options = {{longName = 0x0, shortName = 0 '\000', argInfo = 4, arg = 0x64ae40, val = 0, descrip = 0x43b132 "Help options:", argDescrip = 0x0}, {longName = 0x43b140 "debug-level", shortName = 100 'd', argInfo = 2,
            arg = 0x64af20, val = 0, descrip = 0x43b111 "Debug level", argDescrip = 0x0}, {longName = 0x43b14c "debug-to-files", shortName = 102 'f', argInfo = 0, arg = 0x64af24, val = 0,
            descrip = 0x43bda8 "Send the debug output to files instead of stderr", argDescrip = 0x0}, {longName = 0x43b15b "debug-timestamps", shortName = 0 '\000', argInfo = 2, arg = 0x64ae00, val = 0,
            descrip = 0x43b11d "Add debug timestamps", argDescrip = 0x0}, {longName = 0x43c720 "domain", shortName = 0 '\000', argInfo = 1, arg = 0x7fff5ca923f8, val = 0,
            descrip = 0x43bde0 "Domain of the information provider (mandatory)", argDescrip = 0x0}, {longName = 0x0, shortName = 0 '\000', argInfo = 0, arg = 0x0, val = 0, descrip = 0x0, argDescrip = 0x0}}
        __FUNCTION__ = "main"

Fields changed

milestone: NEEDS_TRIAGE => SSSD 1.8.1 (LTM)
owner: somebody => sgallagh
patch: 0 => 1
status: new => assigned

Fixed by:
- c0828b2 (master)
- 39b8393 (sssd-1-8)

resolution: => fixed
status: assigned => closed

Metadata Update from @sgallagh:
- Issue assigned to sgallagh
- Issue set to the milestone: SSSD 1.8.1 (LTM)

3 years ago

SSSD is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in SSSD's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/SSSD/sssd/issues/2268

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Login to comment on this ticket.

Metadata