#2904 sssd_be AD segfaults on missing A record
Closed: Fixed None Opened 5 years ago by lukebigum.

When using sssd on CentOS 6.6 with the AD backend against a Samba 4.2 AD domain, sssd does not handle a rare failure condition; when the SRV records point at a DC but the A record for that domain controller is missing. sssd_be periodically crashes, it restarts a couple times but generally does not recover:

Dec 16 09:44:05 host kernel: sssd_be[107682]: segfault at 0 ip 00007fd12c5e018b sp 00007fffba8db420 error 4 in libsss_ad.so[7fd12c5ca000+20000]
Dec 16 09:44:05 host abrtd: Directory 'ccpp-2015-12-16-09:44:05-107682' creation detected
Dec 16 09:44:05 host abrt[107687]: Saved core dump of pid 107682 (/usr/libexec/sssd/sssd_be) to /var/spool/abrt/ccpp-2015-12-16-09:44:05-107682 (1978368 bytes)

The SIGSEGV appears to happen in sss_ldap_init_send(), src/util/sss_ldap.c:331.

Getting into this condition is rare - it's a Samba bug that I'm working on separately. The situation could probably be replicated by poisoning DNS though. My expected behavior would be to give up on this DC, try any other DCs in the Site, then try other DCs in other Sites.

I have ABRT crashes and cores / backtraces from GDB.

I'm unable to attach the GDB backtraces and cores, so you can download a compressed tarball of it here if you want it (https://files.lmax.com/rmo325). It will survive there for 20 days.

Some more information:


Will attach the conf and log file.

Thank you for the bug report. Is there a way you can test with more recent packages? 6.6 is quite old..

Either 6.7 or Lukas' test repo: https://copr.fedoraproject.org/coprs/lslebodn/sssd-1-12/ or the 6.8 preview repo: https://copr-fe.cloud.fedoraproject.org/coprs/jhrozek/SSSD-6.8-preview/

That's probably doable; I'm half expecting the Samba server to delete it's own A record at exactly 4pm today, so there's a good chance I'll have an opportunity to try it out. I've synced down your SSSD-6.8-preview repo and will let you know how it goes.

As expected, Samba deleted it's own A record at 4pm. I've got this sssd version on VM: sssd-1.13.2-7.el6.unsupported_preview.x86_64 And it exhibits the exact same symptoms, right down to the same line of code: #0 sss_ldap_init_send (mem_ctx=<value optimized out>, ev=0x128c560, uri=0x12e39a0 "(null)", addr=0x0, addr_len=128, timeout=6) at src/util/sss_ldap.c:349 ret = 0 req = 0x12db5e0 state = 0x12e43f0 __FUNCTION__ = "sss_ldap_init_send" subreq = 0x12e39a0 tv = {tv_sec = 19775120, tv_usec = 104} You can download a GDB core dump and trace from here: https://files.lmax.com/mnfa5p And I will upload an ABRT crash report that also contains a cut down core.

Thank you very much for testing, then I think this is something we should fix in the next upstream version.

Moving into 1.13.4 as per Dec-17 ticket triage.

milestone: NEEDS_TRIAGE => SSSD 1.13.4

Fields changed

owner: somebody => pbrezina
status: new => assigned

Hi, unfortunately, abbrt does not contain sssd logs for some reason. Can you also attach complete logs (/var/log/sssd) with level set to 0x3ff0 please? Thanks.

The fix is relatively trivial but I'd also like to see what is happening there so I can choose proper place to fix.

log file added, the failure condition can be replicated easily enough with dnsmasq by sending DNS requests for the DC to a nowhere:

dnsqmasq --server=/dc.example.com/

I've just realised that log is not complete, it doesn't have any backend logging, which is probably what you want, the AD backend logs?

Hi, unfortunately I am not able to reproduce the issue with current master nor with 1.11 neither by using dnsmasq nor by deleting A record from DNS. Yes, sssd.log is not sufficient, I meant to send all logs in /var/log/sssd/* and as you correctly guessed I am especially interested in sssd_$domain.log

This is really frustrating... 24 hours ago I can get sssd_be to segfault with dnsmasq as per last attached log, now I can't. There must be some other condition as well as the missing A record that is causing this to fail that my environment now doesn't have, and I don't know what it is to cause it again.

At this point in time I can't get you the logs you want. What is happening to me now is the sssd_be is failing to resolve the primary Site's DC and then goes looking for other backup DCs (as you'd expect). The auth still doesn't work to a backup DC for some other reason, but it's not crashing any more.

Not sure if you want to try fix blindly based on the core dump or close this now.

We'll fix it without the logs somehow, but I'd like to know what situation occurred. If you manage to obtain the logs after all, send it please.

Hi, I think I see the code area where the bug lies, I can't identify the exact location without the logs or reproducer. I sent a patch that prevents segafault to the list but if you manage to get those logs, please attach it here. Thanks.

patch: 0 => 1

resolution: => fixed
status: assigned => closed

Metadata Update from @lukebigum:
- Issue assigned to pbrezina
- Issue set to the milestone: SSSD 1.13.4

4 years ago

SSSD is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in SSSD's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/SSSD/sssd/issues/3945

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Login to comment on this ticket.