Learn more about these different git repos.
Other Git URLs
I have seen this issue and recreated it in my lab.
2x Windows AD DCs (DC1 & DC2), both DNS servers and GCs and a simple domain (linuxtest.local), using Kerberos authentication.
Using CentOS 7 for my test, on the same subnet and with both DCs listed in the resolve.conf pinging and resolving DNS fine.
yum installed realmd and krb5-workstation with their required modules.
Used realm join to join the domain and everything appears to work fine. However if one of the 2 DCs goes offline it can cause authentication to fail with SSSD. As far as I can tell it should use the remaining DC but it doesn't.
sssd.conf:
[sssd] domains = linuxtest.local config_file_version = 2 services = nss, pam [domain/linuxtest.local] ad_domain = linuxtest.local krb5_realm = LINUXTEST.LOCAL realmd_tags = manages-system joined-with-adcli cache_credentials = False id_provider = ad krb5_store_password_if_offline = False default_shell = /bin/bash ldap_id_mapping = True use_fully_qualified_names = True fallback_homedir = /home/%u@%d access_provider = ad debug_level = 6
krb5.conf:
# Configuration snippets may be placed in this directory as well includedir /etc/krb5.conf.d/ includedir /var/lib/sss/pubconf/krb5.include.d/ [logging] default = FILE:/var/log/krb5libs.log kdc = FILE:/var/log/krb5kdc.log admin_server = FILE:/var/log/kadmind.log [libdefaults] dns_lookup_realm = false ticket_lifetime = 24h renew_lifetime = 7d forwardable = true rdns = false pkinit_anchors = /etc/pki/tls/certs/ca-bundle.crt # default_realm = EXAMPLE.COM default_ccache_name = KEYRING:persistent:%{uid} default_realm = LINUXTEST.LOCAL [realms] # EXAMPLE.COM = { # kdc = kerberos.example.com # admin_server = kerberos.example.com # } LINUXTEST.LOCAL = { } [domain_realm] # .example.com = EXAMPLE.COM # example.com = EXAMPLE.COM linuxtest.local = LINUXTEST.LOCAL .linuxtest.local = LINUXTEST.LOCAL
SSSD log section:
(Tue May 7 21:12:18 2019) [sssd[be[linuxtest.local]]] [be_ptask_execute] (0x0400): Back end is offline (Tue May 7 21:12:18 2019) [sssd[be[linuxtest.local]]] [be_ptask_execute] (0x0400): Task [Check if online (periodic)]: executing task, timeout 60 seconds (Tue May 7 21:12:18 2019) [sssd[be[linuxtest.local]]] [set_srv_data_status] (0x0100): Marking SRV lookup of service 'AD_GC' as 'neutral' (Tue May 7 21:12:18 2019) [sssd[be[linuxtest.local]]] [fo_set_port_status] (0x0100): Marking port 0 of server '(no name)' as 'neutral' (Tue May 7 21:12:18 2019) [sssd[be[linuxtest.local]]] [set_srv_data_status] (0x0100): Marking SRV lookup of service 'AD' as 'neutral' (Tue May 7 21:12:18 2019) [sssd[be[linuxtest.local]]] [fo_set_port_status] (0x0100): Marking port 0 of server '(no name)' as 'neutral' (Tue May 7 21:12:18 2019) [sssd[be[linuxtest.local]]] [dp_attach_req] (0x0400): DP Request [Online Check #48]: New request. Flags [0000]. (Tue May 7 21:12:18 2019) [sssd[be[linuxtest.local]]] [dp_attach_req] (0x0400): Number of active DP request: 1 (Tue May 7 21:12:18 2019) [sssd[be[linuxtest.local]]] [fo_resolve_service_send] (0x0100): Trying to resolve service 'AD' (Tue May 7 21:12:18 2019) [sssd[be[linuxtest.local]]] [resolve_srv_send] (0x0200): The status of SRV lookup is neutral (Tue May 7 21:12:18 2019) [sssd[be[linuxtest.local]]] [ad_srv_plugin_send] (0x0400): About to find domain controllers (Tue May 7 21:12:18 2019) [sssd[be[linuxtest.local]]] [ad_get_dc_servers_send] (0x0400): Looking up domain controllers in domain linuxtest.local and site Default-First-Site-Name (Tue May 7 21:12:18 2019) [sssd[be[linuxtest.local]]] [resolv_discover_srv_next_domain] (0x0400): SRV resolution of service 'ldap'. Will use DNS discovery domain 'Default-First-Site-Name._sites.linuxtest.local' (Tue May 7 21:12:18 2019) [sssd[be[linuxtest.local]]] [resolv_getsrv_send] (0x0100): Trying to resolve SRV record of '_ldap._tcp.Default-First-Site-Name._sites.linuxtest.local' (Tue May 7 21:12:18 2019) [sssd[be[linuxtest.local]]] [be_ptask_done] (0x0400): Task [Check if online (periodic)]: finished successfully (Tue May 7 21:12:18 2019) [sssd[be[linuxtest.local]]] [be_ptask_schedule] (0x0400): Task [Check if online (periodic)]: scheduling task 1920 seconds from last execution time [1557261858] (Tue May 7 21:12:20 2019) [sssd[be[linuxtest.local]]] [request_watch_destructor] (0x0400): Deleting request watch (Tue May 7 21:12:20 2019) [sssd[be[linuxtest.local]]] [fo_discover_srv_done] (0x0400): Got answer. Processing... (Tue May 7 21:12:20 2019) [sssd[be[linuxtest.local]]] [fo_discover_srv_done] (0x0400): Got 2 servers (Tue May 7 21:12:20 2019) [sssd[be[linuxtest.local]]] [ad_get_dc_servers_done] (0x0400): Found 2 domain controllers in domain Default-First-Site-Name._sites.linuxtest.local (Tue May 7 21:12:20 2019) [sssd[be[linuxtest.local]]] [ad_srv_plugin_dcs_done] (0x0400): About to locate suitable site (Tue May 7 21:12:20 2019) [sssd[be[linuxtest.local]]] [sdap_connect_host_send] (0x0400): Resolving host DC1.linuxtest.local (Tue May 7 21:12:20 2019) [sssd[be[linuxtest.local]]] [resolv_gethostbyname_files_send] (0x0100): Trying to resolve A record of 'DC1.linuxtest.local' in files (Tue May 7 21:12:20 2019) [sssd[be[linuxtest.local]]] [resolv_gethostbyname_files_send] (0x0100): Trying to resolve AAAA record of 'DC1.linuxtest.local' in files (Tue May 7 21:12:20 2019) [sssd[be[linuxtest.local]]] [resolv_gethostbyname_next] (0x0200): No more address families to retry (Tue May 7 21:12:20 2019) [sssd[be[linuxtest.local]]] [resolv_gethostbyname_dns_query] (0x0100): Trying to resolve A record of 'DC1.linuxtest.local' in DNS (Tue May 7 21:12:22 2019) [sssd[be[linuxtest.local]]] [request_watch_destructor] (0x0400): Deleting request watch (Tue May 7 21:12:22 2019) [sssd[be[linuxtest.local]]] [sdap_connect_host_resolv_done] (0x0400): Connecting to ldap://DC1.linuxtest.local:389 (Tue May 7 21:12:22 2019) [sssd[be[linuxtest.local]]] [sssd_async_socket_init_send] (0x0400): Setting 6 seconds timeout for connecting (Tue May 7 21:12:24 2019) [sssd[be[linuxtest.local]]] [fo_resolve_service_timeout] (0x0080): Service resolving timeout reached (Tue May 7 21:12:24 2019) [sssd[be[linuxtest.local]]] [sssd_async_socket_state_destructor] (0x0400): closing socket [26] (Tue May 7 21:12:24 2019) [sssd[be[linuxtest.local]]] [dp_req_done] (0x0400): DP Request [Online Check #48]: Request handler finished [0]: Success (Tue May 7 21:12:24 2019) [sssd[be[linuxtest.local]]] [_dp_req_recv] (0x0400): DP Request [Online Check #48]: Receiving request data. (Tue May 7 21:12:24 2019) [sssd[be[linuxtest.local]]] [dp_req_destructor] (0x0400): DP Request [Online Check #48]: Request removed. (Tue May 7 21:12:24 2019) [sssd[be[linuxtest.local]]] [dp_req_destructor] (0x0400): Number of active DP request: 0 (Tue May 7 21:12:24 2019) [sssd[be[linuxtest.local]]] [be_check_online_done] (0x0400): Backend is offline
To me it appears that DNS query returns both DCs but only the first is tried.
Can anyone shed any light on this? I think its a bug.
sssd.x86_64 1.16.2-13.el7_6.8
Hi,
Please try to set 'dns_resolver_timeout' in the [domain/...] section of sssd.conf to a higher value then 6 (seconds) which is the default. It looks like with the disabled DC DNS is slower as well, I guess the disabled one is the first in resolv.conf.
HTH
bye, Sumit
Hi Sumit,
Yes its the first DNS server in the list.
I set 'dns_resolver_timeout' to 10 seconds and this appears to have helped, I'll set 12 going forward to ensure it doesn't time out if things are busy.
Could the default 6 seconds be increased as this is a fairly common scenario in the real world and will trip up many people.
Thanks,
With the default value we try to find a good balance between allowing sufficient time for a service to reply and detecting early if the service is not available so that the backend can switch into offline mode. So increasing the default would mean that there might be cases where SSSD needs much more time.
Additionally there are multiple timeouts involved which should be aligned as well, they are currently not, see https://pagure.io/SSSD/sssd/issue/3217.
Finally, as you can see in the logs you have send there is a timeout of 2s at two times. First when resolving the SRV record, second when looking up the DC itself. This 2s timeout is currently not configurable and is defined as
#define RESOLV_TIMEOUTMS 2000
I'm wondering if it would be possible to tell out resolver library c-ares, to move DNS servers which had a timeout at the end of the list of DNS servers instead running over a static list of servers again and again. @jhrozek, do you have any idea?
Not really. Ares keeps track of the server state, but only as part of a single query. There is a flag STAYOPEN which just keeps the connections to the DNS servers open even after the query finishes and as a result remembers the state. But then you need to close the connection yourself at some point. There is no "reset state" public function except for the ones that either resets the whole channel (the ares context) or resets the servers.
Maybe it would be possible to keep the connection open for the duration of a failover request and re-set it afterwards..
Does the rotate option in resolve.conf impact this issue?
It seems to me that the rotate option is orthogonal, so if you set it, you might lower the chance that you hit the server that does not respond, but if you do, the issue would remain the same, no?
My testing would support that yes. Not a fix but allows logon some of the time. Less than ideal.
Metadata Update from @pbrezina: - Issue tagged with: Canditate to close
Thank you for taking time to submit this request for SSSD. Unfortunately this issue was not given priority and the team lacks the capacity to work on it at this time.
Given that we are unable to fulfill this request I am closing the issue as wontfix.
If the issue still persist on recent SSSD you can request re-consideration of this decision by reopening this issue. Please provide additional technical details about its importance to you.
Thank you for understanding.
Metadata Update from @pbrezina: - Issue close_status updated to: wontfix - Issue status updated to: Closed (was: Open)
SSSD is moving from Pagure to Github. This means that new issues and pull requests will be accepted only in SSSD's github repository.
This issue has been cloned to Github and is available here: - https://github.com/SSSD/sssd/issues/4975
If you want to receive further updates on the issue, please navigate to the github issue and click on subscribe button.
subscribe
Thank you for understanding. We apologize for all inconvenience.
Login to comment on this ticket.