#3312 SSSD AD Failover Failure

Created 8 months ago by chrismwheeler
Modified 2 days ago

I am attempting to understand the failure of our Red Hat Linux devices to correctly fail over when using SSSD authentication to AD, the logs of which show marking inaccessible devices as 'working', and then after failure not attempting to communicate with the accessible one.

My test setup, on IBM SoftLayer, involves a single Red Hat 6.x VM, authtest.testdom.local (IP 192.168.100.50), set up on a different VLAN to the AD servers, mgt-dc-001.testdom.local (192.168.200.133) and mgt-dc-002.testdom.local (192.168.200.134). The traffic is allowed or blocked by a network device, infra-vyatta.testdom.local, which allows me to simulate server failure. (Note - IPs and Hostnames have been changed for security)

I have currently got logging on maximum, and have captured log files, and network traffic, whilst all communication with the server mgt-dc-001.testdom.local is blocked by Vyatta, but all communication with server mgt-dc-002.testdom.local is allowed.

The attached logs show two events, the start of sssd service at 10:53:11, and the attempt to log on at 10:54:21.

What I would like to understand is the following.

Why is Port 88 of mgt-dc-001 marked as working, when the network communication attached clearly shows that it is not responding:

(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [_be_fo_set_port_status] (0x8000): Setting status: PORT_WORKING. Called from: src/providers/krb5/krb5_auth.c: krb5_auth_done: 1039
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [fo_set_port_status] (0x0100): Marking port 88 of server 'mgt-dc-001.testdom.local' as 'working'
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [set_server_common_status] (0x0100): Marking server 'mgt-dc-001.testdom.local' as 'working'
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [fo_set_port_status] (0x0400): Marking port 88 of duplicate server 'mgt-dc-001.testdom.local' as 'working'

Because it is not responding, Kerberos drops into offline mode, rather than looking to mgt-dc-002.testdom.local :

(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [krb5_auth_done] (0x0100): Backend is marked offline, retry later!

So it then tries to reset all services, as 'no server' is available :

(Fri Feb 17 10:54:48 2017) [sssd[be[TESTDOM.LOCAL]]] [fo_reset_services] (0x1000): Resetting all servers in all services

Whenever it attempts to check port 88 on mgt-dc-001 it seems to mark it working, never says it checks mgt-dc-002 and keeps falling back to offline authentication, which then doesn't work.

Also, for the AD and LDAP services it seems to be doing a lot of marking port 0 not working, but if a port is specified, it cannot resolve the name, as it attempts to do DNS resolution with the port attached ! I'm not sure if this is related.

What we expect is: SSSD to find that port 88 on mgt-dc-001.testdom.local is not working, and to mark it "not working" and to then fail over to mgt-dc-002. Occasionally, this seems to happen, but very occasionally. From the network communication you can see that it attempts port 88 communication to mgt-dc-002, which is returned, and port 88 communication to mgt-dc-001 which is not. You can see after LDAP communication on 389 fails to mgt-dc-001, it successfully connects to mgt-dc-002. You can see that DNS communication with mgt-dc-002 is succeeding.

I think I have attached all relevant configuration and log files.

Am I doing something obviously wrong, or is the failover in this version not working as it should ?

Thanks

Attachments
Auth_Debug.zip - 2017-09-14 08:48:28 Comment Download

Hmm, you might be right. I would expect this code:

(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [_be_fo_set_port_status] (0x8000): Setting status: PORT_WORKING. Called from: src/providers/krb5/krb5_auth.c: krb5_auth_done: 1039                                                      
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [fo_set_port_status] (0x0100): Marking port 88 of server 'mgt-dc-001.testdom.local' as 'working'                                                                                        
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [set_server_common_status] (0x0100): Marking server 'mgt-dc-001.testdom.local' as 'working'                                                                                             
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [fo_set_port_status] (0x0400): Marking port 88 of duplicate server 'mgt-dc-001.testdom.local' as 'working'                                                                              
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [krb5_mod_ccname] (0x4000): Save ccname [FILE:/tmp/krb5cc_1763401190_ech4R3] for user [gb103690].                                                                                       
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [ldb] (0x4000): start ldb transaction (nesting: 0)                                                                                                                                      
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [ldb] (0x4000): start ldb transaction (nesting: 1)                                                                                                                                      
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [ldb] (0x4000): Added timed event "ltdb_callback": 0x111c640

(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [ldb] (0x4000): Added timed event "ltdb_timeout": 0x111c770

(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [ldb] (0x4000): Running timer event 0x111c640 "ltdb_callback"

(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [ldb] (0x4000): Destroying timer event 0x111c770 "ltdb_timeout"

(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [ldb] (0x4000): Ending timer event 0x111c640 "ltdb_callback"

(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [ldb] (0x4000): commit ldb transaction (nesting: 1)                                                                                                                                     
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [ldb] (0x4000): commit ldb transaction (nesting: 0)                                                                                                                                     
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [krb5_auth_done] (0x0100): Backend is marked offline, retry later!                                                                                                                      
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [check_wait_queue] (0x1000): Wait queue for user [gb103690] is empty.                                                                                                                   
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [krb5_auth_queue_done] (0x1000): krb5_auth_queue request [0x1118f00] done.                                                                                                              
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [be_pam_handler_callback] (0x0100): Backend returned: (1, 9, <NULL>) [Provider is Offline]

To reset the offline status when the port is marked as working..

Fields changed

description: I am attempting to understand the failure of our Red Hat Linux devices to correctly fail over when using SSSD authentication to AD, the logs of which show marking inaccessible devices as 'working', and then after failure not attempting to communicate with the accessible one.

My test setup, on IBM SoftLayer, involves a single Red Hat 6.x VM, authtest.testdom.local (IP 192.168.100.50), set up on a different VLAN to the AD servers, mgt-dc-001.testdom.local (192.168.200.133) and mgt-dc-002.testdom.local (192.168.200.134). The traffic is allowed or blocked by a network device, infra-vyatta.testdom.local, which allows me to simulate server failure. (Note - IPs and Hostnames have been changed for security)

I have currently got logging on maximum, and have captured log files, and network traffic, whilst all communication with the server mgt-dc-001.testdom.local is blocked by Vyatta, but all communication with server mgt-dc-002.testdom.local is allowed.

The attached logs show two events, the start of sssd service at 10:53:11, and the attempt to log on at 10:54:21.

What I would like to understand is the following.

Why is Port 88 of mgt-dc-001 marked as working, when the network communication attached clearly shows that it is not responding:

(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [_be_fo_set_port_status] (0x8000): Setting status: PORT_WORKING. Called from: src/providers/krb5/krb5_auth.c: krb5_auth_done: 1039
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [fo_set_port_status] (0x0100): Marking port 88 of server 'mgt-dc-001.testdom.local' as 'working'
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [set_server_common_status] (0x0100): Marking server 'mgt-dc-001.testdom.local' as 'working'
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [fo_set_port_status] (0x0400): Marking port 88 of duplicate server 'mgt-dc-001.testdom.local' as 'working'

Because it is not responding, Kerberos drops into offline mode, rather than looking to mgt-dc-002.testdom.local :

(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [krb5_auth_done] (0x0100): Backend is marked offline, retry later!

So it then tries to reset all services, as 'no server' is available :

(Fri Feb 17 10:54:48 2017) [sssd[be[TESTDOM.LOCAL]]] [fo_reset_services] (0x1000): Resetting all servers in all services

Whenever it attempts to check port 88 on mgt-dc-001 it seems to mark it working, never says it checks mgt-dc-002 and keeps falling back to offline authentication, which then doesn't work.

Also, for the AD and LDAP services it seems to be doing a lot of marking port 0 not working, but if a port is specified, it cannot resolve the name, as it attempts to do DNS resolution with the port attached ! I'm not sure if this is related.

What we expect is: SSSD to find that port 88 on mgt-dc-001.testdom.local is not working, and to mark it "not working" and to then fail over to mgt-dc-002. Occasionally, this seems to happen, but very occasionally. From the network communication you can see that it attempts port 88 communication to mgt-dc-002, which is returned, and port 88 communication to mgt-dc-001 which is not. You can see after LDAP communication on 389 fails to mgt-dc-001, it successfully connects to mgt-dc-002. You can see that DNS communication with mgt-dc-002 is succeeding.

I think I have attached all relevant configuration and log files.

Am I doing something obviously wrong, or is the failover in this version not working as it should ?

Thanks => I am attempting to understand the failure of our Red Hat Linux devices to correctly fail over when using SSSD authentication to AD, the logs of which show marking inaccessible devices as 'working', and then after failure not attempting to communicate with the accessible one.

My test setup, on IBM SoftLayer, involves a single Red Hat 6.x VM, authtest.testdom.local (IP 192.168.100.50), set up on a different VLAN to the AD servers, mgt-dc-001.testdom.local (192.168.200.133) and mgt-dc-002.testdom.local (192.168.200.134). The traffic is allowed or blocked by a network device, infra-vyatta.testdom.local, which allows me to simulate server failure. (Note - IPs and Hostnames have been changed for security)

I have currently got logging on maximum, and have captured log files, and network traffic, whilst all communication with the server mgt-dc-001.testdom.local is blocked by Vyatta, but all communication with server mgt-dc-002.testdom.local is allowed.

The attached logs show two events, the start of sssd service at 10:53:11, and the attempt to log on at 10:54:21.

What I would like to understand is the following.

Why is Port 88 of mgt-dc-001 marked as working, when the network communication attached clearly shows that it is not responding:

{{{
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [_be_fo_set_port_status] (0x8000): Setting status: PORT_WORKING. Called from: src/providers/krb5/krb5_auth.c: krb5_auth_done: 1039
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [fo_set_port_status] (0x0100): Marking port 88 of server 'mgt-dc-001.testdom.local' as 'working'
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [set_server_common_status] (0x0100): Marking server 'mgt-dc-001.testdom.local' as 'working'
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [fo_set_port_status] (0x0400): Marking port 88 of duplicate server 'mgt-dc-001.testdom.local' as 'working'
}}}

Because it is not responding, Kerberos drops into offline mode, rather than looking to mgt-dc-002.testdom.local :
{{{
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [krb5_auth_done] (0x0100): Backend is marked offline, retry later!
}}}

So it then tries to reset all services, as 'no server' is available :

{{{
(Fri Feb 17 10:54:48 2017) [sssd[be[TESTDOM.LOCAL]]] [fo_reset_services] (0x1000): Resetting all servers in all services
}}}

Whenever it attempts to check port 88 on mgt-dc-001 it seems to mark it working, never says it checks mgt-dc-002 and keeps falling back to offline authentication, which then doesn't work.

Also, for the AD and LDAP services it seems to be doing a lot of marking port 0 not working, but if a port is specified, it cannot resolve the name, as it attempts to do DNS resolution with the port attached ! I'm not sure if this is related.

What we expect is: SSSD to find that port 88 on mgt-dc-001.testdom.local is not working, and to mark it "not working" and to then fail over to mgt-dc-002. Occasionally, this seems to happen, but very occasionally. From the network communication you can see that it attempts port 88 communication to mgt-dc-002, which is returned, and port 88 communication to mgt-dc-001 which is not. You can see after LDAP communication on 389 fails to mgt-dc-001, it successfully connects to mgt-dc-002. You can see that DNS communication with mgt-dc-002 is succeeding.

I think I have attached all relevant configuration and log files.

Am I doing something obviously wrong, or is the failover in this version not working as it should ?

Thanks

We're take a look once the core features for 1.15 are done, but as I said, this looks like a bug.

milestone: NEEDS_TRIAGE => SSSD 1.15.3

8 months ago

Metadata Update from @chrismwheeler:
- Issue set to the milestone: SSSD 1.15.3

7 months ago

Metadata Update from @jhrozek:
- Custom field design_review reset
- Custom field mark reset
- Custom field patch reset
- Custom field review reset
- Custom field sensitive reset
- Custom field testsupdated reset
- Issue close_status updated to: None
- Issue set to the milestone: SSSD 1.15.4 (was: SSSD 1.15.3)

Hi, I'm sorry that it took so long, but unfortunately the debug logs are not available anymore. Do you still have this issue? Could you please attach new logs?

However, from the description, it does look similar as https://pagure.io/SSSD/sssd/issue/3401

a month ago

Metadata Update from @pbrezina:
- Custom field design_review reset (from false)
- Custom field mark reset (from false)
- Custom field patch reset (from false)
- Custom field review reset (from false)
- Custom field sensitive reset (from false)
- Custom field testsupdated reset (from false)

Files re-uploaded on request:

Auth_Debug.zip

It does indeed look similar.

I see you have very low timeout 1 second which from the logs doesn't seem to be enough. Can you try to set dns_resolver_timeout to some higher value? It also should be bigger then dns_resolver_op_timeout which was not very clear until we mentioned it in documentation in recent version. See: https://pagure.io/SSSD/sssd/issue/3217#comment-223033

Set it for example to dns_resolver_timeout = 6, dns_resolver_op_timeout = 3.

a month ago

Metadata Update from @pbrezina:
- Custom field design_review reset (from false)
- Custom field mark reset (from false)
- Custom field patch reset (from false)
- Custom field review reset (from false)
- Custom field sensitive reset (from false)
- Custom field testsupdated reset (from false)

We've decided to rename the 1.15.4 release to 1.16.0, but since we need to release 1.16.0 tomorrow, I'm bumping this ticket to 1.16.1

2 days ago

Metadata Update from @jhrozek:
- Custom field design_review reset (from false)
- Custom field mark reset (from false)
- Custom field patch reset (from false)
- Custom field review reset (from false)
- Custom field sensitive reset (from false)
- Custom field testsupdated reset (from false)

2 days ago

Metadata Update from @jhrozek:
- Custom field design_review reset (from false)
- Custom field mark reset (from false)
- Custom field patch reset (from false)
- Custom field review reset (from false)
- Custom field sensitive reset (from false)
- Custom field testsupdated reset (from false)
- Issue set to the milestone: SSSD 1.16.1 (was: SSSD 1.15.4)

Login to comment on this ticket.

defect

SSSD

1.13.3

false

false

false

false

false

false

cancel