#1966 SSSD failover doesn't work if the first DNS server in resolv.conf is unavailable
Closed: Fixed None Opened 7 years ago by jhrozek.

Ticket was cloned from Red Hat Bugzilla (product Red Hat Enterprise Linux 6): Bug 966757

Created attachment 752397
Traces from client and DC of failure and success

Description of problem: When using SRV record failover for integration of SSSD
with Active Directory, everything works fine if the first listed DNS server
specified in resolv.conf is alive and well.

However, if the first listed DNS server in resolv.conf is down, SSSD black
holes the request.

In a packet trace, we can see the behavior as a RST after the DNS server that
is down is queried.

This can be problematic in Active Directory, as often, the DNS servers are the
DCs, which in turn are also the LDAP servers. The point of failover is to
survive an outage of an LDAP server, but if that LDAP server is also the first
listed DNS server, SSSD just breaks.

The behavior is similar to that of attempting to use RR DNS for failover, which
is not supported as per the documentation.

SSSD should not fail if there are other viable DNS servers in resolv.conf.
Instead, it should move on to the next DNS server and retry the request.

Version-Release number of selected component (if applicable):
# rpm -qa sssd

How reproducible:
Consistently reproducible

Steps to Reproduce:
1. Power down the first DNS server listed in resolv.conf
2. stop SSSD, remove the cache and start SSSD
3. attempt getent or id to the LDAP server

Actual results:
getent/id fails to return valid info
kerberos ticket is issued properly, SASL bind works, but LDAP connection gets

Expected results:
SSSD should pick up the next DNS server and re-try the request.

Additional info:

Contents of /etc/resolv.conf file:

# cat /etc/resolv.conf
search domain.com

The first server in the list is unreachable:

# ping
PING ( 56(84) bytes of data.
From icmp_seq=2 Destination Host Unreachable
From icmp_seq=3 Destination Host Unreachable
From icmp_seq=4 Destination Host Unreachable
--- ping statistics ---
5 packets transmitted, 0 received, +3 errors, 100% packet loss, time 4206ms
pipe 3

Dig still works fine, meaning the other DNS servers are working properly:

# dig SRV _ldap._tcp.domain.com

; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.17.rc1.el6 <<>> SRV _ldap._tcp.domain..com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 25072
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 3

;_ldap._tcp.domain.com. IN        SRV

_ldap._tcp.domain.com. 600 IN SRV 0 100 389 2k8-dc-3.domain.com.
_ldap._tcp.domain.com. 600 IN SRV 0 100 389 2k8-dc-1.domain.com.
_ldap._tcp.domain.com. 600 IN SRV 0 100 389 2k8-dc-2.domain.com.

2k8-dc-3.domain.com. 3600 IN A
2k8-dc-1.domain.com. 3600 IN A
2k8-dc-2.domain.com. 3600 IN A

;; Query time: 3 msec
;; WHEN: Thu May 23 17:14:22 2013
;; MSG SIZE  rcvd: 260

The sssd.conf file is set up to leverage SRV records (by way of omitting the
ldap_uri, krb5_kpasswd and krb5_server values:

# cat /etc/sssd/sssd.conf
cache_credentials = True
case_sensitive = False
config_file_version = 2
services = nss, pam
domains = DOMAIN
debug_level = 0x4000
filter_users = root,ldap,named,avahi,haldaemon,dbus,radiusd,news,nscd
filter_groups = root
id_provider = ldap
auth_provider = krb5
case_sensitive = False
chpass_provider = krb5
ldap_search_base = dc=domain,dc=com
ldap_schema = rfc2307
ldap_sasl_mech = GSSAPI
ldap_user_object_class = user
ldap_group_object_class = group
ldap_user_home_directory = unixHomeDirectory
ldap_user_principal = userPrincipalName
ldap_group_member = memberUid
ldap_group_name = cn
ldap_account_expire_policy = ad
ldap_force_upper_case_realm = true
ldap_group_search_base = cn=Users,dc=domain,dc=com
ldap_sasl_authid = root/centos64.domain.com@DOMAIN.COM
entry_cache_timeout = 120
krb5_realm = DOMAIN.COM
cache_credentials = false
krb5_canonicalize = false

When the first server is down, SSSD lookups fail:

# service sssd stop
Stopping sssd:                                             [  OK  ]
# rm -f /var/lib/sss/db/*
# service sssd start
Starting sssd:                                             [  OK  ]
# getent passwd ldapuser

When a working DNS server is moved to first in the list, SSSD lookups succeed
without even needing to restart SSSD:

# cat /etc/resolv.conf
search domain.com

# ping
PING ( 56(84) bytes of data.
64 bytes from icmp_seq=1 ttl=128 time=0.830 ms
64 bytes from icmp_seq=2 ttl=128 time=0.266 ms
--- ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1040ms
rtt min/avg/max/mdev = 0.266/0.548/0.830/0.282 ms

# getent passwd ldapuser

Packet traces of the failed attempt and the successful attempt are attached.

In the trace:

- SSSD is started after clearing the cache
- getent passwd ldapuser is executed

Traces filtered for IPs on domain controllers and client. Actual domain names
are in the traces. Domain names above just used as placeholders.


client -
DC1/DNS1 -
DC2/DNS2 -
DC3/DNS3 -

Fields changed

blockedby: =>
blocking: =>
changelog: =>
coverity: =>
design: =>
design_review: => 0
feature_milestone: =>
fedora_test_page: =>
milestone: NEEDS_TRIAGE => SSSD 1.10.0
review: True => 0
selected: =>
testsupdated: => 0

Fields changed

milestone: SSSD 1.10.0 => SSSD 1.10.1

Fields changed

owner: somebody => mzidek

Michal, please test if this issue still persists with Pavel's patches that are currently on review in the thread called "[PATCHES] fix SRV expansion".

Moving tickets that didn't make 1.10.1 to the 1.10.2 bucket.

Moving tickets that didn't make 1.10.1 to 1.10.2

milestone: SSSD 1.10.1 => SSSD 1.10.2

Fields changed

patch: 0 => 1

resolution: => fixed
status: new => closed

Fields changed

changelog: => N/A, just a bugfix

Metadata Update from @jhrozek:
- Issue assigned to mzidek
- Issue set to the milestone: SSSD 1.10.2

3 years ago

SSSD is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in SSSD's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/SSSD/sssd/issues/3008

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Login to comment on this ticket.