#1966 SSSD failover doesn't work if the first DNS server in resolv.conf is unavailable
Closed: Fixed None Opened 5 years ago by jhrozek.

Ticket was cloned from Red Hat Bugzilla (product Red Hat Enterprise Linux 6): Bug 966757

Created attachment 752397
Traces from client and DC of failure and success

Description of problem: When using SRV record failover for integration of SSSD
with Active Directory, everything works fine if the first listed DNS server
specified in resolv.conf is alive and well.

However, if the first listed DNS server in resolv.conf is down, SSSD black
holes the request.

In a packet trace, we can see the behavior as a RST after the DNS server that
is down is queried.

This can be problematic in Active Directory, as often, the DNS servers are the
DCs, which in turn are also the LDAP servers. The point of failover is to
survive an outage of an LDAP server, but if that LDAP server is also the first
listed DNS server, SSSD just breaks.

The behavior is similar to that of attempting to use RR DNS for failover, which
is not supported as per the documentation.

SSSD should not fail if there are other viable DNS servers in resolv.conf.
Instead, it should move on to the next DNS server and retry the request.


Version-Release number of selected component (if applicable):
# rpm -qa sssd
sssd-1.9.2-82.7.el6_4.x86_64


How reproducible:
Consistently reproducible

Steps to Reproduce:
1. Power down the first DNS server listed in resolv.conf
2. stop SSSD, remove the cache and start SSSD
3. attempt getent or id to the LDAP server

Actual results:
getent/id fails to return valid info
kerberos ticket is issued properly, SASL bind works, but LDAP connection gets
reset

Expected results:
SSSD should pick up the next DNS server and re-try the request.

Additional info:

Contents of /etc/resolv.conf file:

# cat /etc/resolv.conf
search domain.com
nameserver 10.61.179.155
nameserver 10.61.179.152
nameserver 10.61.179.174

The first server in the list is unreachable:

# ping 10.61.179.155
PING 10.61.179.155 (10.61.179.155) 56(84) bytes of data.
From 10.61.179.150 icmp_seq=2 Destination Host Unreachable
From 10.61.179.150 icmp_seq=3 Destination Host Unreachable
From 10.61.179.150 icmp_seq=4 Destination Host Unreachable
^C
--- 10.61.179.155 ping statistics ---
5 packets transmitted, 0 received, +3 errors, 100% packet loss, time 4206ms
pipe 3


Dig still works fine, meaning the other DNS servers are working properly:

# dig SRV _ldap._tcp.domain.com

; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.17.rc1.el6 <<>> SRV _ldap._tcp.domain..com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 25072
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 3

;; QUESTION SECTION:
;_ldap._tcp.domain.com. IN        SRV

;; ANSWER SECTION:
_ldap._tcp.domain.com. 600 IN SRV 0 100 389 2k8-dc-3.domain.com.
_ldap._tcp.domain.com. 600 IN SRV 0 100 389 2k8-dc-1.domain.com.
_ldap._tcp.domain.com. 600 IN SRV 0 100 389 2k8-dc-2.domain.com.

;; ADDITIONAL SECTION:
2k8-dc-3.domain.com. 3600 IN A    10.61.179.174
2k8-dc-1.domain.com. 3600 IN A    10.61.179.152
2k8-dc-2.domain.com. 3600 IN A    10.61.179.155

;; Query time: 3 msec
;; SERVER: 10.61.179.152#53(10.61.179.152)
;; WHEN: Thu May 23 17:14:22 2013
;; MSG SIZE  rcvd: 260

The sssd.conf file is set up to leverage SRV records (by way of omitting the
ldap_uri, krb5_kpasswd and krb5_server values:

# cat /etc/sssd/sssd.conf
[domain/default]
cache_credentials = True
case_sensitive = False
[sssd]
config_file_version = 2
services = nss, pam
domains = DOMAIN
debug_level = 0x4000
[nss]
filter_users = root,ldap,named,avahi,haldaemon,dbus,radiusd,news,nscd
filter_groups = root
[pam]
[domain/DOMAIN]
id_provider = ldap
auth_provider = krb5
case_sensitive = False
chpass_provider = krb5
ldap_search_base = dc=domain,dc=com
ldap_schema = rfc2307
ldap_sasl_mech = GSSAPI
ldap_user_object_class = user
ldap_group_object_class = group
ldap_user_home_directory = unixHomeDirectory
ldap_user_principal = userPrincipalName
ldap_group_member = memberUid
ldap_group_name = cn
ldap_account_expire_policy = ad
ldap_force_upper_case_realm = true
ldap_group_search_base = cn=Users,dc=domain,dc=com
ldap_sasl_authid = root/centos64.domain.com@DOMAIN.COM
entry_cache_timeout = 120
krb5_realm = DOMAIN.COM
cache_credentials = false
krb5_canonicalize = false

When the first server is down, SSSD lookups fail:

# service sssd stop
Stopping sssd:                                             [  OK  ]
# rm -f /var/lib/sss/db/*
# service sssd start
Starting sssd:                                             [  OK  ]
# getent passwd ldapuser
#


When a working DNS server is moved to first in the list, SSSD lookups succeed
without even needing to restart SSSD:

# cat /etc/resolv.conf
search domain.com
nameserver 10.61.179.152
nameserver 10.61.179.155
nameserver 10.61.179.174

# ping 10.61.179.152
PING 10.61.179.152 (10.61.179.152) 56(84) bytes of data.
64 bytes from 10.61.179.152: icmp_seq=1 ttl=128 time=0.830 ms
64 bytes from 10.61.179.152: icmp_seq=2 ttl=128 time=0.266 ms
^C
--- 10.61.179.152 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1040ms
rtt min/avg/max/mdev = 0.266/0.548/0.830/0.282 ms


# getent passwd ldapuser
ldapuser:*:1603:513:ldapuser:/home/ldapuser:/bin/sh

Packet traces of the failed attempt and the successful attempt are attached.

In the trace:

- SSSD is started after clearing the cache
- getent passwd ldapuser is executed

Traces filtered for IPs on domain controllers and client. Actual domain names
are in the traces. Domain names above just used as placeholders.

IPs:

client - 10.61.179.150
DC1/DNS1 - 10.61.179.152
DC2/DNS2 - 10.61.179.155
DC3/DNS3 - 10.61.179.174

Fields changed

blockedby: =>
blocking: =>
changelog: =>
coverity: =>
design: =>
design_review: => 0
feature_milestone: =>
fedora_test_page: =>
milestone: NEEDS_TRIAGE => SSSD 1.10.0
review: True => 0
selected: =>
testsupdated: => 0

Fields changed

milestone: SSSD 1.10.0 => SSSD 1.10.1

Fields changed

owner: somebody => mzidek

Michal, please test if this issue still persists with Pavel's patches that are currently on review in the thread called "[PATCHES] fix SRV expansion".

Moving tickets that didn't make 1.10.1 to the 1.10.2 bucket.

Moving tickets that didn't make 1.10.1 to 1.10.2

milestone: SSSD 1.10.1 => SSSD 1.10.2

Fields changed

patch: 0 => 1

resolution: => fixed
status: new => closed

Fields changed

changelog: => N/A, just a bugfix

Metadata Update from @jhrozek:
- Issue assigned to mzidek
- Issue set to the milestone: SSSD 1.10.2

2 years ago

Login to comment on this ticket.

Metadata