#3312 SSSD AD Failover Failure

Created 6 months ago by chrismwheeler
Modified 5 months ago

I am attempting to understand the failure of our Red Hat Linux devices to correctly fail over when using SSSD authentication to AD, the logs of which show marking inaccessible devices as 'working', and then after failure not attempting to communicate with the accessible one.

My test setup, on IBM SoftLayer, involves a single Red Hat 6.x VM, authtest.testdom.local (IP 192.168.100.50), set up on a different VLAN to the AD servers, mgt-dc-001.testdom.local (192.168.200.133) and mgt-dc-002.testdom.local (192.168.200.134). The traffic is allowed or blocked by a network device, infra-vyatta.testdom.local, which allows me to simulate server failure. (Note - IPs and Hostnames have been changed for security)

I have currently got logging on maximum, and have captured log files, and network traffic, whilst all communication with the server mgt-dc-001.testdom.local is blocked by Vyatta, but all communication with server mgt-dc-002.testdom.local is allowed.

The attached logs show two events, the start of sssd service at 10:53:11, and the attempt to log on at 10:54:21.

What I would like to understand is the following.

Why is Port 88 of mgt-dc-001 marked as working, when the network communication attached clearly shows that it is not responding:

(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [_be_fo_set_port_status] (0x8000): Setting status: PORT_WORKING. Called from: src/providers/krb5/krb5_auth.c: krb5_auth_done: 1039
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [fo_set_port_status] (0x0100): Marking port 88 of server 'mgt-dc-001.testdom.local' as 'working'
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [set_server_common_status] (0x0100): Marking server 'mgt-dc-001.testdom.local' as 'working'
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [fo_set_port_status] (0x0400): Marking port 88 of duplicate server 'mgt-dc-001.testdom.local' as 'working'

Because it is not responding, Kerberos drops into offline mode, rather than looking to mgt-dc-002.testdom.local :

(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [krb5_auth_done] (0x0100): Backend is marked offline, retry later!

So it then tries to reset all services, as 'no server' is available :

(Fri Feb 17 10:54:48 2017) [sssd[be[TESTDOM.LOCAL]]] [fo_reset_services] (0x1000): Resetting all servers in all services

Whenever it attempts to check port 88 on mgt-dc-001 it seems to mark it working, never says it checks mgt-dc-002 and keeps falling back to offline authentication, which then doesn't work.

Also, for the AD and LDAP services it seems to be doing a lot of marking port 0 not working, but if a port is specified, it cannot resolve the name, as it attempts to do DNS resolution with the port attached ! I'm not sure if this is related.

What we expect is: SSSD to find that port 88 on mgt-dc-001.testdom.local is not working, and to mark it "not working" and to then fail over to mgt-dc-002. Occasionally, this seems to happen, but very occasionally. From the network communication you can see that it attempts port 88 communication to mgt-dc-002, which is returned, and port 88 communication to mgt-dc-001 which is not. You can see after LDAP communication on 389 fails to mgt-dc-001, it successfully connects to mgt-dc-002. You can see that DNS communication with mgt-dc-002 is succeeding.

I think I have attached all relevant configuration and log files.

Am I doing something obviously wrong, or is the failover in this version not working as it should ?

Thanks

Hmm, you might be right. I would expect this code:

(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [_be_fo_set_port_status] (0x8000): Setting status: PORT_WORKING. Called from: src/providers/krb5/krb5_auth.c: krb5_auth_done: 1039                                                      
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [fo_set_port_status] (0x0100): Marking port 88 of server 'mgt-dc-001.testdom.local' as 'working'                                                                                        
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [set_server_common_status] (0x0100): Marking server 'mgt-dc-001.testdom.local' as 'working'                                                                                             
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [fo_set_port_status] (0x0400): Marking port 88 of duplicate server 'mgt-dc-001.testdom.local' as 'working'                                                                              
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [krb5_mod_ccname] (0x4000): Save ccname [FILE:/tmp/krb5cc_1763401190_ech4R3] for user [gb103690].                                                                                       
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [ldb] (0x4000): start ldb transaction (nesting: 0)                                                                                                                                      
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [ldb] (0x4000): start ldb transaction (nesting: 1)                                                                                                                                      
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [ldb] (0x4000): Added timed event "ltdb_callback": 0x111c640

(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [ldb] (0x4000): Added timed event "ltdb_timeout": 0x111c770

(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [ldb] (0x4000): Running timer event 0x111c640 "ltdb_callback"

(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [ldb] (0x4000): Destroying timer event 0x111c770 "ltdb_timeout"

(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [ldb] (0x4000): Ending timer event 0x111c640 "ltdb_callback"

(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [ldb] (0x4000): commit ldb transaction (nesting: 1)                                                                                                                                     
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [ldb] (0x4000): commit ldb transaction (nesting: 0)                                                                                                                                     
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [krb5_auth_done] (0x0100): Backend is marked offline, retry later!                                                                                                                      
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [check_wait_queue] (0x1000): Wait queue for user [gb103690] is empty.                                                                                                                   
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [krb5_auth_queue_done] (0x1000): krb5_auth_queue request [0x1118f00] done.                                                                                                              
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [be_pam_handler_callback] (0x0100): Backend returned: (1, 9, <NULL>) [Provider is Offline]

To reset the offline status when the port is marked as working..

Fields changed

description: I am attempting to understand the failure of our Red Hat Linux devices to correctly fail over when using SSSD authentication to AD, the logs of which show marking inaccessible devices as 'working', and then after failure not attempting to communicate with the accessible one.

My test setup, on IBM SoftLayer, involves a single Red Hat 6.x VM, authtest.testdom.local (IP 192.168.100.50), set up on a different VLAN to the AD servers, mgt-dc-001.testdom.local (192.168.200.133) and mgt-dc-002.testdom.local (192.168.200.134). The traffic is allowed or blocked by a network device, infra-vyatta.testdom.local, which allows me to simulate server failure. (Note - IPs and Hostnames have been changed for security)

I have currently got logging on maximum, and have captured log files, and network traffic, whilst all communication with the server mgt-dc-001.testdom.local is blocked by Vyatta, but all communication with server mgt-dc-002.testdom.local is allowed.

The attached logs show two events, the start of sssd service at 10:53:11, and the attempt to log on at 10:54:21.

What I would like to understand is the following.

Why is Port 88 of mgt-dc-001 marked as working, when the network communication attached clearly shows that it is not responding:

(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [_be_fo_set_port_status] (0x8000): Setting status: PORT_WORKING. Called from: src/providers/krb5/krb5_auth.c: krb5_auth_done: 1039
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [fo_set_port_status] (0x0100): Marking port 88 of server 'mgt-dc-001.testdom.local' as 'working'
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [set_server_common_status] (0x0100): Marking server 'mgt-dc-001.testdom.local' as 'working'
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [fo_set_port_status] (0x0400): Marking port 88 of duplicate server 'mgt-dc-001.testdom.local' as 'working'

Because it is not responding, Kerberos drops into offline mode, rather than looking to mgt-dc-002.testdom.local :

(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [krb5_auth_done] (0x0100): Backend is marked offline, retry later!

So it then tries to reset all services, as 'no server' is available :

(Fri Feb 17 10:54:48 2017) [sssd[be[TESTDOM.LOCAL]]] [fo_reset_services] (0x1000): Resetting all servers in all services

Whenever it attempts to check port 88 on mgt-dc-001 it seems to mark it working, never says it checks mgt-dc-002 and keeps falling back to offline authentication, which then doesn't work.

Also, for the AD and LDAP services it seems to be doing a lot of marking port 0 not working, but if a port is specified, it cannot resolve the name, as it attempts to do DNS resolution with the port attached ! I'm not sure if this is related.

What we expect is: SSSD to find that port 88 on mgt-dc-001.testdom.local is not working, and to mark it "not working" and to then fail over to mgt-dc-002. Occasionally, this seems to happen, but very occasionally. From the network communication you can see that it attempts port 88 communication to mgt-dc-002, which is returned, and port 88 communication to mgt-dc-001 which is not. You can see after LDAP communication on 389 fails to mgt-dc-001, it successfully connects to mgt-dc-002. You can see that DNS communication with mgt-dc-002 is succeeding.

I think I have attached all relevant configuration and log files.

Am I doing something obviously wrong, or is the failover in this version not working as it should ?

Thanks => I am attempting to understand the failure of our Red Hat Linux devices to correctly fail over when using SSSD authentication to AD, the logs of which show marking inaccessible devices as 'working', and then after failure not attempting to communicate with the accessible one.

My test setup, on IBM SoftLayer, involves a single Red Hat 6.x VM, authtest.testdom.local (IP 192.168.100.50), set up on a different VLAN to the AD servers, mgt-dc-001.testdom.local (192.168.200.133) and mgt-dc-002.testdom.local (192.168.200.134). The traffic is allowed or blocked by a network device, infra-vyatta.testdom.local, which allows me to simulate server failure. (Note - IPs and Hostnames have been changed for security)

I have currently got logging on maximum, and have captured log files, and network traffic, whilst all communication with the server mgt-dc-001.testdom.local is blocked by Vyatta, but all communication with server mgt-dc-002.testdom.local is allowed.

The attached logs show two events, the start of sssd service at 10:53:11, and the attempt to log on at 10:54:21.

What I would like to understand is the following.

Why is Port 88 of mgt-dc-001 marked as working, when the network communication attached clearly shows that it is not responding:

{{{
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [_be_fo_set_port_status] (0x8000): Setting status: PORT_WORKING. Called from: src/providers/krb5/krb5_auth.c: krb5_auth_done: 1039
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [fo_set_port_status] (0x0100): Marking port 88 of server 'mgt-dc-001.testdom.local' as 'working'
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [set_server_common_status] (0x0100): Marking server 'mgt-dc-001.testdom.local' as 'working'
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [fo_set_port_status] (0x0400): Marking port 88 of duplicate server 'mgt-dc-001.testdom.local' as 'working'
}}}

Because it is not responding, Kerberos drops into offline mode, rather than looking to mgt-dc-002.testdom.local :
{{{
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [krb5_auth_done] (0x0100): Backend is marked offline, retry later!
}}}

So it then tries to reset all services, as 'no server' is available :

{{{
(Fri Feb 17 10:54:48 2017) [sssd[be[TESTDOM.LOCAL]]] [fo_reset_services] (0x1000): Resetting all servers in all services
}}}

Whenever it attempts to check port 88 on mgt-dc-001 it seems to mark it working, never says it checks mgt-dc-002 and keeps falling back to offline authentication, which then doesn't work.

Also, for the AD and LDAP services it seems to be doing a lot of marking port 0 not working, but if a port is specified, it cannot resolve the name, as it attempts to do DNS resolution with the port attached ! I'm not sure if this is related.

What we expect is: SSSD to find that port 88 on mgt-dc-001.testdom.local is not working, and to mark it "not working" and to then fail over to mgt-dc-002. Occasionally, this seems to happen, but very occasionally. From the network communication you can see that it attempts port 88 communication to mgt-dc-002, which is returned, and port 88 communication to mgt-dc-001 which is not. You can see after LDAP communication on 389 fails to mgt-dc-001, it successfully connects to mgt-dc-002. You can see that DNS communication with mgt-dc-002 is succeeding.

I think I have attached all relevant configuration and log files.

Am I doing something obviously wrong, or is the failover in this version not working as it should ?

Thanks

We're take a look once the core features for 1.15 are done, but as I said, this looks like a bug.

milestone: NEEDS_TRIAGE => SSSD 1.15.3

6 months ago

Metadata Update from @chrismwheeler:
- Issue set to the milestone: SSSD 1.15.3

5 months ago

Metadata Update from @jhrozek:
- Custom field design_review reset
- Custom field mark reset
- Custom field patch reset
- Custom field review reset
- Custom field sensitive reset
- Custom field testsupdated reset
- Issue close_status updated to: None
- Issue set to the milestone: SSSD 1.15.4 (was: SSSD 1.15.3)

Login to comment on this ticket.

defect

SSSD

1.13.3

false

false

false

false

false

false

cancel