Issue #3044: sssd-ad backend does not try all backends after first failure - sssd

SSSD / sssd

#3044 sssd-ad backend does not try all backends after first failure

Closed: Invalid None Opened 7 years ago by lukebigum.

After a recent DC failure at one of our Sites, almost all the SSSD clients (1.13.3-22) did not fail over to a backup Site. To recover them at the time I forced 'ad_server' and 'krb5_server' to another DC. Now that the DC is recovered and I have time to play around, it looks like it's related to DNS.

The Samba 4 AD DC that had failed is also primary DNS for the Site. If I firewall off one client and restart sssd, more often than not I get resolution failures like this, which take the AD backend offline:

(Mon Jun 13 12:19:39 2016) [sssd[be[example.com]]] [ad_online_cb] (0x0400): The AD provider is online
(Mon Jun 13 12:19:45 2016) [sssd[be[example.com]]] [fo_resolve_service_timeout] (0x0080): Service resolving timeout reached
(Mon Jun 13 12:19:45 2016) [sssd[be[example.com]]] [sss_ldap_init_state_destructor] (0x0400): closing socket [24]
(Mon Jun 13 12:19:45 2016) [sssd[be[example.com]]] [be_resolve_server_done] (0x1000): Server resolution failed: 14
(Mon Jun 13 12:19:45 2016) [sssd[be[example.com]]] [sdap_id_op_connect_done] (0x0020): Failed to connect, going offline (5 [Input/output error])
(Mon Jun 13 12:19:45 2016) [sssd[be[example.com]]] [be_ptask_create] (0x0400): Periodic task [Check if online (periodic)] was created
(Mon Jun 13 12:19:45 2016) [sssd[be[example.com]]] [be_ptask_schedule] (0x0400): Task [Check if online (periodic)]: scheduling task 63 seconds from now [1465820448]
(Mon Jun 13 12:19:45 2016) [sssd[be[example.com]]] [be_run_offline_cb] (0x0080): Going offline. Running callbacks.

It doesn't always happen 100% of the time; sometimes it looks like the resolver tries again with the next DNS server fast enough so that the sssd backend remains up, but mostly it doesn't. For example in this Site of 450 odd clients, over 80-90% of them could not authenticate until I reconfigured them all to point to another Site's DC. Sometimes the SRV lookup works, and then it looks like a subsequent DNS query fails.

I can reproduce the problem with iptables quite easily, so can generate logs if needed. I'd prefer to send un-sanitised logs privately to someone rather than attaching them to a bug report.

The steps you would need to reproduce are:

2 x Samba 4 or AD DCs, separated into Sites using SRV lookups and with 'ad_enable_dns_sites = true'
One client (version 1.13.3-22)
Bind / DNS on active DC of client's site
Firewall off client from active DC of client's site
Restart sssd on client

Brainstorming some ideas... Could you try every DNS server (in src/providers/ad/ad_srv.c) rather than just failing on the first? Repeatedly try every DNS server available, but then have a much larger, global timeout for each DNS lookup group? Cache the servers you are able to resolve and in case a subsequent DNS lookup fails, refer to the cached copy as a backup attempt?

lukebigum commented 7 years ago

I should also mention that there are two other DNS servers available on this test client, and a standard libc DNS lookup works from the command line, so it's not like DNS is completely unavailable to the client:

[root@client1 sssd]# grep nameserver /etc/resolv.conf | wc -l
3
[root@client1 sssd]# host -t SRV _ldap._tcp.example.com
_ldap._tcp.example.com has SRV record 0 100 389 co1dc02.example.com.
_ldap._tcp.example.com has SRV record 0 100 389 ix3dc01.example.com.
_ldap._tcp.example.com has SRV record 0 100 389 co1dc01.example.com.

lukebigum commented 7 years ago

Was only able to reproduce #3042 by firewalling off one DC, so I now think the two issues are related. I've sanitised an AD backend log file showing the problem. The client in the testing here has dc03 as it's first name server in resolv.conf. Steps to reproduce:

1) On dc03: iptables -A INPUT -s client/32 -j DROP
2) On the client: service sssd stop; rm -f /var/lib/sss/db/ /var/log/sssd/; service sssd start
3) ssh client

The logs show a connection to ldap://dc03.example.com:389 timing out:

(Wed Jun 15 09:38:59 2016) [sssd[be[example.com]]] [set_server_common_status] (0x0100): Marking server 'dc03.example.com' as 'name resolved'
(Wed Jun 15 09:38:59 2016) [sssd[be[example.com]]] [be_resolve_server_process] (0x1000): Saving the first resolved server
(Wed Jun 15 09:38:59 2016) [sssd[be[example.com]]] [be_resolve_server_process] (0x0200): Found address for server dc03.example.com: [10.103.250.203] TTL 116
(Wed Jun 15 09:38:59 2016) [sssd[be[example.com]]] [sdap_uri_callback] (0x0400): Constructed uri 'ldap://dc03.example.com:389'
(Wed Jun 15 09:38:59 2016) [sssd[be[example.com]]] [sss_ldap_init_send] (0x0400): Setting 6 seconds timeout for connecting
(Wed Jun 15 09:38:59 2016) [sssd[be[example.com]]] [client_registration] (0x0100): Cancel DP ID timeout [0x15c8c40]
(Wed Jun 15 09:38:59 2016) [sssd[be[example.com]]] [client_registration] (0x0100): Added Frontend client [NSS]
(Wed Jun 15 09:38:59 2016) [sssd[be[example.com]]] [be_get_subdomains] (0x0400): Got get subdomains []
(Wed Jun 15 09:39:05 2016) [sssd[be[example.com]]] [fo_resolve_service_timeout] (0x0080): Service resolving timeout reached
(Wed Jun 15 09:39:05 2016) [sssd[be[example.com]]] [sss_ldap_init_state_destructor] (0x0400): closing socket [23]
(Wed Jun 15 09:39:05 2016) [sssd[be[example.com]]] [be_resolve_server_done] (0x1000): Server resolution failed: 14
(Wed Jun 15 09:39:05 2016) [sssd[be[example.com]]] [sdap_id_op_connect_done] (0x0020): Failed to connect, going offline (5 [Input/output error])
(Wed Jun 15 09:39:05 2016) [sssd[be[example.com]]] [be_ptask_create] (0x0400): Periodic task [Check if online (periodic)] was created
(Wed Jun 15 09:39:05 2016) [sssd[be[example.com]]] [be_ptask_schedule] (0x0400): Task [Check if online (periodic)]: scheduling task 87 seconds from now [1465983632]
(Wed Jun 15 09:39:05 2016) [sssd[be[example.com]]] [be_run_offline_cb] (0x0080): Going offline. Running callbacks.

Even though the backend knows there are other servers:

(Wed Jun 15 09:38:59 2016) [sssd[be[example.com]]] [fo_add_server_to_list] (0x0400): Inserted primary server 'dc03.example.com:389' to service 'LDAP'
(Wed Jun 15 09:38:59 2016) [sssd[be[example.com]]] [fo_add_server_to_list] (0x0400): Inserted backup server 'dc01.example.com:389' to service 'LDAP'
(Wed Jun 15 09:38:59 2016) [sssd[be[example.com]]] [fo_add_server_to_list] (0x0400): Inserted backup server 'dc02.example.com:389' to service 'LDAP'

I've tested this by just firewalling off DNS, and I get the same result:

iptables -A INPUT -s client/32 -p tcp --dport 53 -j DROP
iptables -A INPUT -s client/32 -p udp --dport 53 -j DROP

I'll attach the sanitised log.

lukebigum commented 7 years ago

ad backend log showing failures taking entire back end offline
sssd_ad_be.log

lukebigum commented 7 years ago

Fields changed

summary: sssd-ad backend too susceptible to DNS failures => sssd-ad backend does not try all backends after first failure

lukebigum commented 7 years ago

Ignore the bits about the DC needing to be the primary nameserver. I thought it was primary DNS, but it's not, it's a completely separate server. So these are not DNS timeouts (or if they are it's not because the DC is offline). I think I was also confused with the message "Service resolving timeout reached", and that it's not a DNS resolver timeout.

So if it is just service timeouts (LDAP / Kerberos), the behaviour appears to be putting the backend offline from the first service failure, and it only becomes healthy again after the internal health check timers activate and a different DC is attempted. sssd should be trying every DC it finds, including backup servers, even though there's only one DC for the current Site.

jhrozek commented 7 years ago

Sorry about the delay. To make sure I understand what you're saying, do you mean that SSSD just failed hard on a connection failure despite having more servers for that service to try? If yes, then it sounds a bit like https://fedorahosted.org/sssd/ticket/3010 which we only fixed recently (or a variant of it). Can you test packages for us? You were running 6.8, right?

lukebigum commented 7 years ago

Yes, you described my issue. My earlier thoughts about DNS turned out to be totally wrong.

#3010 could be it. I've currently got https://jhrozek.fedorapeople.org/sssd-test-builds/sssd-6.8-fd-leak/ on a subset of hosts, so if you don't mind patching on top of that I can try out the patches in #3010. If it has to be 1.14 series, though, that's not really a problem.

jhrozek commented 7 years ago

I'm sorry about the long delay again. Can you try this build, please?
https://jhrozek.fedorapeople.org/sssd-test-builds/sssd-6.8-ad-failover/

If it doesn't fix your problem, at least we know it's a new bug and we need to fix it separately..

jhrozek commented 7 years ago

Looks like the test build solved the issue :) if not, please reopen this ticket.

resolution: => worksforme
status: new => closed

Metadata Update from @lukebigum:
- Issue set to the milestone: NEEDS_TRIAGE

7 years ago

pbrezina commented 3 years ago

SSSD is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in SSSD's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/SSSD/sssd/issues/4077

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata

Assignee

None

Tags

None

Blocking

None

Depending on

None

Priority

major

Milestone

NEEDS_TRIAGE

type

defect

component

SSSD

version

1.13.3

selected

None

testsupdated

patch

rhbz

None

design_review

review

changelog

None

keywords

None

coverity

None

mark

blocking

None

design

None

sensitive

None

blockedby

None

feature_milestone

None

SSSD / sssd

Source Code

Documentation

#3044 sssd-ad backend does not try all backends after first failure Closed: Invalid None Opened 7 years ago by lukebigum.

Metadata

#3044 sssd-ad backend does not try all backends after first failure

Closed: Invalid None Opened 7 years ago by lukebigum.