#4004 Active Directory Domain Controller failure issue
Closed: wontfix a year ago by pbrezina. Opened 2 years ago by woodylinux.

I have seen this issue and recreated it in my lab.

2x Windows AD DCs (DC1 & DC2), both DNS servers and GCs and a simple domain (linuxtest.local), using Kerberos authentication.

Using CentOS 7 for my test, on the same subnet and with both DCs listed in the resolve.conf pinging and resolving DNS fine.

yum installed realmd and krb5-workstation with their required modules.

Used realm join to join the domain and everything appears to work fine. However if one of the 2 DCs goes offline it can cause authentication to fail with SSSD. As far as I can tell it should use the remaining DC but it doesn't.

sssd.conf:

[sssd]
domains = linuxtest.local
config_file_version = 2
services = nss, pam

[domain/linuxtest.local]
ad_domain = linuxtest.local
krb5_realm = LINUXTEST.LOCAL
realmd_tags = manages-system joined-with-adcli
cache_credentials = False
id_provider = ad
krb5_store_password_if_offline = False
default_shell = /bin/bash
ldap_id_mapping = True
use_fully_qualified_names = True
fallback_homedir = /home/%u@%d
access_provider = ad
debug_level = 6

krb5.conf:

# Configuration snippets may be placed in this directory as well
includedir /etc/krb5.conf.d/

includedir /var/lib/sss/pubconf/krb5.include.d/
[logging]
 default = FILE:/var/log/krb5libs.log
 kdc = FILE:/var/log/krb5kdc.log
 admin_server = FILE:/var/log/kadmind.log

[libdefaults]
 dns_lookup_realm = false
 ticket_lifetime = 24h
 renew_lifetime = 7d
 forwardable = true
 rdns = false
 pkinit_anchors = /etc/pki/tls/certs/ca-bundle.crt
# default_realm = EXAMPLE.COM
 default_ccache_name = KEYRING:persistent:%{uid}

 default_realm = LINUXTEST.LOCAL
[realms]
# EXAMPLE.COM = {
#  kdc = kerberos.example.com
#  admin_server = kerberos.example.com
# }

 LINUXTEST.LOCAL = {
 }

[domain_realm]
# .example.com = EXAMPLE.COM
# example.com = EXAMPLE.COM
 linuxtest.local = LINUXTEST.LOCAL
 .linuxtest.local = LINUXTEST.LOCAL

SSSD log section:

(Tue May  7 21:12:18 2019) [sssd[be[linuxtest.local]]] [be_ptask_execute] (0x0400): Back end is offline
(Tue May  7 21:12:18 2019) [sssd[be[linuxtest.local]]] [be_ptask_execute] (0x0400): Task [Check if online (periodic)]: executing task, timeout 60 seconds
(Tue May  7 21:12:18 2019) [sssd[be[linuxtest.local]]] [set_srv_data_status] (0x0100): Marking SRV lookup of service 'AD_GC' as 'neutral'
(Tue May  7 21:12:18 2019) [sssd[be[linuxtest.local]]] [fo_set_port_status] (0x0100): Marking port 0 of server '(no name)' as 'neutral'
(Tue May  7 21:12:18 2019) [sssd[be[linuxtest.local]]] [set_srv_data_status] (0x0100): Marking SRV lookup of service 'AD' as 'neutral'
(Tue May  7 21:12:18 2019) [sssd[be[linuxtest.local]]] [fo_set_port_status] (0x0100): Marking port 0 of server '(no name)' as 'neutral'
(Tue May  7 21:12:18 2019) [sssd[be[linuxtest.local]]] [dp_attach_req] (0x0400): DP Request [Online Check #48]: New request. Flags [0000].
(Tue May  7 21:12:18 2019) [sssd[be[linuxtest.local]]] [dp_attach_req] (0x0400): Number of active DP request: 1
(Tue May  7 21:12:18 2019) [sssd[be[linuxtest.local]]] [fo_resolve_service_send] (0x0100): Trying to resolve service 'AD'
(Tue May  7 21:12:18 2019) [sssd[be[linuxtest.local]]] [resolve_srv_send] (0x0200): The status of SRV lookup is neutral
(Tue May  7 21:12:18 2019) [sssd[be[linuxtest.local]]] [ad_srv_plugin_send] (0x0400): About to find domain controllers
(Tue May  7 21:12:18 2019) [sssd[be[linuxtest.local]]] [ad_get_dc_servers_send] (0x0400): Looking up domain controllers in domain linuxtest.local and site Default-First-Site-Name
(Tue May  7 21:12:18 2019) [sssd[be[linuxtest.local]]] [resolv_discover_srv_next_domain] (0x0400): SRV resolution of service 'ldap'. Will use DNS discovery domain 'Default-First-Site-Name._sites.linuxtest.local'
(Tue May  7 21:12:18 2019) [sssd[be[linuxtest.local]]] [resolv_getsrv_send] (0x0100): Trying to resolve SRV record of '_ldap._tcp.Default-First-Site-Name._sites.linuxtest.local'
(Tue May  7 21:12:18 2019) [sssd[be[linuxtest.local]]] [be_ptask_done] (0x0400): Task [Check if online (periodic)]: finished successfully
(Tue May  7 21:12:18 2019) [sssd[be[linuxtest.local]]] [be_ptask_schedule] (0x0400): Task [Check if online (periodic)]: scheduling task 1920 seconds from last execution time [1557261858]
(Tue May  7 21:12:20 2019) [sssd[be[linuxtest.local]]] [request_watch_destructor] (0x0400): Deleting request watch
(Tue May  7 21:12:20 2019) [sssd[be[linuxtest.local]]] [fo_discover_srv_done] (0x0400): Got answer. Processing...
(Tue May  7 21:12:20 2019) [sssd[be[linuxtest.local]]] [fo_discover_srv_done] (0x0400): Got 2 servers
(Tue May  7 21:12:20 2019) [sssd[be[linuxtest.local]]] [ad_get_dc_servers_done] (0x0400): Found 2 domain controllers in domain Default-First-Site-Name._sites.linuxtest.local
(Tue May  7 21:12:20 2019) [sssd[be[linuxtest.local]]] [ad_srv_plugin_dcs_done] (0x0400): About to locate suitable site
(Tue May  7 21:12:20 2019) [sssd[be[linuxtest.local]]] [sdap_connect_host_send] (0x0400): Resolving host DC1.linuxtest.local
(Tue May  7 21:12:20 2019) [sssd[be[linuxtest.local]]] [resolv_gethostbyname_files_send] (0x0100): Trying to resolve A record of 'DC1.linuxtest.local' in files
(Tue May  7 21:12:20 2019) [sssd[be[linuxtest.local]]] [resolv_gethostbyname_files_send] (0x0100): Trying to resolve AAAA record of 'DC1.linuxtest.local' in files
(Tue May  7 21:12:20 2019) [sssd[be[linuxtest.local]]] [resolv_gethostbyname_next] (0x0200): No more address families to retry
(Tue May  7 21:12:20 2019) [sssd[be[linuxtest.local]]] [resolv_gethostbyname_dns_query] (0x0100): Trying to resolve A record of 'DC1.linuxtest.local' in DNS
(Tue May  7 21:12:22 2019) [sssd[be[linuxtest.local]]] [request_watch_destructor] (0x0400): Deleting request watch
(Tue May  7 21:12:22 2019) [sssd[be[linuxtest.local]]] [sdap_connect_host_resolv_done] (0x0400): Connecting to ldap://DC1.linuxtest.local:389
(Tue May  7 21:12:22 2019) [sssd[be[linuxtest.local]]] [sssd_async_socket_init_send] (0x0400): Setting 6 seconds timeout for connecting
(Tue May  7 21:12:24 2019) [sssd[be[linuxtest.local]]] [fo_resolve_service_timeout] (0x0080): Service resolving timeout reached
(Tue May  7 21:12:24 2019) [sssd[be[linuxtest.local]]] [sssd_async_socket_state_destructor] (0x0400): closing socket [26]
(Tue May  7 21:12:24 2019) [sssd[be[linuxtest.local]]] [dp_req_done] (0x0400): DP Request [Online Check #48]: Request handler finished [0]: Success
(Tue May  7 21:12:24 2019) [sssd[be[linuxtest.local]]] [_dp_req_recv] (0x0400): DP Request [Online Check #48]: Receiving request data.
(Tue May  7 21:12:24 2019) [sssd[be[linuxtest.local]]] [dp_req_destructor] (0x0400): DP Request [Online Check #48]: Request removed.
(Tue May  7 21:12:24 2019) [sssd[be[linuxtest.local]]] [dp_req_destructor] (0x0400): Number of active DP request: 0
(Tue May  7 21:12:24 2019) [sssd[be[linuxtest.local]]] [be_check_online_done] (0x0400): Backend is offline

To me it appears that DNS query returns both DCs but only the first is tried.

Can anyone shed any light on this? I think its a bug.

sssd.x86_64 1.16.2-13.el7_6.8


Hi,

Please try to set 'dns_resolver_timeout' in the [domain/...] section of sssd.conf to a higher value then 6 (seconds) which is the default. It looks like with the disabled DC DNS is slower as well, I guess the disabled one is the first in resolv.conf.

HTH

bye,
Sumit

Hi Sumit,

Yes its the first DNS server in the list.

I set 'dns_resolver_timeout' to 10 seconds and this appears to have helped, I'll set 12 going forward to ensure it doesn't time out if things are busy.

Could the default 6 seconds be increased as this is a fairly common scenario in the real world and will trip up many people.

Thanks,

With the default value we try to find a good balance between allowing sufficient time for a service to reply and detecting early if the service is not available so that the backend can switch into offline mode. So increasing the default would mean that there might be cases where SSSD needs much more time.

Additionally there are multiple timeouts involved which should be aligned as well, they are currently not, see https://pagure.io/SSSD/sssd/issue/3217.

Finally, as you can see in the logs you have send there is a timeout of 2s at two times. First when resolving the SRV record, second when looking up the DC itself. This 2s timeout is currently not configurable and is defined as

#define RESOLV_TIMEOUTMS  2000

I'm wondering if it would be possible to tell out resolver library c-ares, to move DNS servers which had a timeout at the end of the list of DNS servers instead running over a static list of servers again and again. @jhrozek, do you have any idea?

bye,
Sumit

Not really. Ares keeps track of the server state, but only as part of a single query. There is a flag STAYOPEN which just keeps the connections to the DNS servers open even after the query finishes and as a result remembers the state. But then you need to close the connection yourself at some point. There is no "reset state" public function except for the ones that either resets the whole channel (the ares context) or resets the servers.

Maybe it would be possible to keep the connection open for the duration of a failover request and re-set it afterwards..

Does the rotate option in resolve.conf impact this issue?

It seems to me that the rotate option is orthogonal, so if you set it, you might lower the chance that you hit the server that does not respond, but if you do, the issue would remain the same, no?

It seems to me that the rotate option is orthogonal, so if you set it, you might lower the chance that you hit the server that does not respond, but if you do, the issue would remain the same, no?

My testing would support that yes. Not a fix but allows logon some of the time. Less than ideal.

Metadata Update from @pbrezina:
- Issue tagged with: Canditate to close

a year ago

Thank you for taking time to submit this request for SSSD. Unfortunately this issue was not given priority and the team lacks the capacity to work on it at this time.

Given that we are unable to fulfill this request I am closing the issue as wontfix.

If the issue still persist on recent SSSD you can request re-consideration of this decision by reopening this issue. Please provide additional technical details about its importance to you.

Thank you for understanding.

Metadata Update from @pbrezina:
- Issue close_status updated to: wontfix
- Issue status updated to: Closed (was: Open)

a year ago

SSSD is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in SSSD's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/SSSD/sssd/issues/4975

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Login to comment on this ticket.

Metadata