#1037 SIGSEGV in sssd_be
Closed: Fixed None Opened 7 years ago by prefect.

sssd is configured against Active Directory.

sssd_be crashed dumping core:

Core was generated by `/usr/libexec/sssd/sssd_be --domain default --debug-to-files'.
Program terminated with signal 11, Segmentation fault.
#0  0x0000000000411a93 in fo_set_port_status (server=0x50eef600, 
    status=PORT_WORKING) at src/providers/fail_over.c:1332
1332            if (!siter->common || !siter->common->name) continue;

(gdb) list
1327        /* It is possible to introduce duplicates when expanding SRV results
1328         * into fo_server structures. Find the duplicates and set the same
1329         * status */
1330        DLIST_FOR_EACH(siter, server->service->server_list) {
1331            if (siter == server) continue;
1332            if (!siter->common || !siter->common->name) continue;
1333
1334            if (siter->port == server->port &&
1335                (strcasecmp(siter->common->name, server->common->name) == 0)) {
1336                DEBUG(7, ("Marking port %d of duplicate server '%s' as '%s'\n",

(gdb) print siter->common
$1 = (struct server_common *) 0xa0
(gdb) print siter->common->name
Cannot access memory at address 0xc0

/var/log/secure:

Oct  7 04:26:01 blah crond[6022]: pam_sss(crond:account): Request to sssd failed. Timer expired

core file has been retained (but is large - 1.3Gbytes), and an sssd_default.log is availabled at log level 9.


Easier to read log from gdb
gdb-log

Fields changed component: SSSD => Data Provider description: sssd is configured against Active Directory. sssd_be crashed dumping core: Core was generated by `/usr/libexec/sssd/sssd_be --domain default --debug-to-files'. Program terminated with signal 11, Segmentation fault. #0 0x0000000000411a93 in fo_set_port_status (server=0x50eef600, status=PORT_WORKING) at src/providers/fail_over.c:1332 1332 if (!siter->common || !siter->common->name) continue; (gdb) list 1327 /* It is possible to introduce duplicates when expanding SRV results 1328 * into fo_server structures. Find the duplicates and set the same 1329 * status */ 1330 DLIST_FOR_EACH(siter, server->service->server_list) { 1331 if (siter == server) continue; 1332 if (!siter->common || !siter->common->name) continue; 1333 1334 if (siter->port == server->port && 1335 (strcasecmp(siter->common->name, server->common->name) == 0)) { 1336 DEBUG(7, ("Marking port %d of duplicate server '%s' as '%s'\n", (gdb) print siter->common $1 = (struct server_common *) 0xa0 (gdb) print siter->common->name Cannot access memory at address 0xc0 /var/log/secure: Oct 7 04:26:01 blah crond[6022]: pam_sss(crond:account): Request to sssd failed. Timer expired core file has been retained (but is large - 1.3Gbytes), and an sssd_default.log is availabled at log level 9. => sssd is configured against Active Directory. sssd_be crashed dumping core: {{{ Core was generated by `/usr/libexec/sssd/sssd_be --domain default --debug-to-files'. Program terminated with signal 11, Segmentation fault. #0 0x0000000000411a93 in fo_set_port_status (server=0x50eef600, status=PORT_WORKING) at src/providers/fail_over.c:1332 1332 if (!siter->common || !siter->common->name) continue; (gdb) list 1327 /* It is possible to introduce duplicates when expanding SRV results 1328 * into fo_server structures. Find the duplicates and set the same 1329 * status */ 1330 DLIST_FOR_EACH(siter, server->service->server_list) { 1331 if (siter == server) continue; 1332 if (!siter->common || !siter->common->name) continue; 1333 1334 if (siter->port == server->port && 1335 (strcasecmp(siter->common->name, server->common->name) == 0)) { 1336 DEBUG(7, ("Marking port %d of duplicate server '%s' as '%s'\n", (gdb) print siter->common $1 = (struct server_common *) 0xa0 (gdb) print siter->common->name Cannot access memory at address 0xc0 }}} /var/log/secure: {{{ Oct 7 04:26:01 blah crond[6022]: pam_sss(crond:account): Request to sssd failed. Timer expired }}} core file has been retained (but is large - 1.3Gbytes), and an sssd_default.log is availabled at log level 9. priority: major => critical

Sorry for not asking this sooner, but do you still have SSSD logs from when the bug happened? It would be very beneficial to see what resolving SSSD performed etc.

Also, if you still have the core file, can you examine some data structures for me, please?

I would like to see the following from inside the fo_set_port_status() function:

print server->service->ctx
print *server->service->ctx
print *server->service->ctx->server_common_list

Thank you!

Fields changed

milestone: NEEDS_TRIAGE => SSSD 1.7.0
priority: critical => blocker

Fields changed

owner: somebody => jzeleny

Besides information jhrozek asked for earlier, I'd also greatly appreciate a reproducer, i.e. sanitized config file and steps you had to perform to induce this segfault. I'd like a core file of my own so I could inspect the code in detail.

Thanks
Jan

There has been no activity for some time in this ticket. I'd like to ask you once more for the additional information we requested. If no more info is provided, I'll close the ticket as worksforme.

Replying to [comment:7 jzeleny]:

Sorry for not getting back to you, I'd not seen the movement on this ticket. I've still got the sssd logs and the core dumps, but not the matching build of 1.6.1 I had installed at the time, so I'm not sure the value of it. I've not got the matching /var/log/secure which makes lining up the timings of when things went wrong and matching that up with the 4.4.Gbyte sssd_default.log a little fun.

I upgraded to 1.6.3 and have not seen this problem again. I've left in place a script that monitors the logs for this failure, so should be able to catch it again if it happens in future. Before it was happening every week or two on a heavily loaded system, so it should crop up again soon enough if the problem's not fixed.

I have had crashes of sssd_be since, but they've all recovered gracefully.

jh

Replying to [comment:3 jhrozek]: > Also, if you still have the core file, can you examine some data structures for me, please? > > I would like to see the following from inside the `fo_set_port_status()` function: > {{{ > print server->service->ctx > print *server->service->ctx > print *server->service->ctx->server_common_list > }}} > > Thank you! Actually, I seem to have a instance or three of this crash against 1.6.3 built straight from git, so maybe I shouldn't write off this bug yet. Log level is 0 unfortunately so I have nothing there. Core was generated by `/usr/libexec/sssd/sssd_be --domain default --debug-to-files'. Program terminated with signal 11, Segmentation fault. #0 0x0000000000411c03 in fo_set_port_status (server=0x21c5420, status=PORT_WORKING) at src/providers/fail_over.c:1332 1332 if (!siter->common || !siter->common->name) continue; (gdb) print server->service->ctx $1 = (struct fo_ctx *) 0x1ff4120 gdb) print *server->service->ctx $2 = {service_list = 0x2018270, server_common_list = 0x21e3ec0, opts = 0x1ff75a0} (gdb) print *server->service->ctx->server_common_list $3 = {DO_NOT_TOUCH_THIS_MEMBER_refcount = 5, ctx = 0x1ff4120, prev = 0x0, next = 0x21e40f0, name = 0x21e3f70 "az24.qa.fails.co.zn", rhostent = 0x2039c00, request_list = 0x0, server_status = 3, last_status_change = {tv_sec = 1321099319, tv_usec = 18606}} I'll bob the config on in a minute. version: 1.6.1 => 1.6.3

I don't have a reliable reproducer unfortunately and there's not an obvious pattern. The machine sits in service with a reasonable number of users coming in and out over ssh. Over the last month (a mix of the old 1.6.1 and the newer 1.6.3) it sssd_be has crashed 9 times. What log level would be useful?

sssd.conf:

[sssd]
config_file_version = 2
reconnection_retries = 3
sbus_timeout = 30
services = nss, pam
domains = default

[nss]
filter_groups = root
filter_users = root
reconnection_retries = 3

[pam]
reconnection_retries = 3

[domain/default]
lookup_family_order=ipv4_only
auth_provider = krb5
cache_credentials = false
krb5_realm = EXAMPLE.COM
chpass_provider = krb5
id_provider = ldap
dns_discovery_domain = EXAMPLE.COM
krb5_validate = true
krb5_renew_interval = 300
min_id = 100
access_provider = simple
simple_allow_groups = a_group
enumerate = false

ldap_force_upper_case_realm = True
ldap_schema = rfc2307bis
ldap_referrals = false
ldap_search_base = dc=example,dc=com
ldap_sasl_mech = gssapi
ldap_pwd_policy = none

ldap_user_object_class = user
ldap_user_name = sAMAccountName
ldap_user_uid_number = msSFU30UidNumber
ldap_user_gid_number = primaryGroupID
ldap_user_gecos = displayName
ldap_user_home_directory = msSFU30HomeDirectory
ldap_user_shell = msSFU30LoginShell
ldap_user_principal = userPrincipalName

ldap_group_object_class = group
ldap_group_name = cn
ldap_group_gid_number = msSFU30GidNumber
ldap_group_search_base = ou=blah,ou=blah,dc=example,dc=com

Fields changed

status: new => assigned

I'm going to close this one, as the patch which probably fixes this has been pushed to master. Please feel free to reopen if the error persists on your system.

Fixed in: d4d9091

resolution: => fixed
status: assigned => closed

Fields changed

rhbz: => 0

Metadata Update from @prefect:
- Issue assigned to jzeleny
- Issue set to the milestone: SSSD 1.7.0

2 years ago

Login to comment on this ticket.

Metadata