#3312 SSSD AD Failover Failure
Closed: cloned-to-github 3 years ago by pbrezina. Opened 7 years ago by chrismwheeler.

I am attempting to understand the failure of our Red Hat Linux devices to correctly fail over when using SSSD authentication to AD, the logs of which show marking inaccessible devices as 'working', and then after failure not attempting to communicate with the accessible one.

My test setup, on IBM SoftLayer, involves a single Red Hat 6.x VM, authtest.testdom.local (IP 192.168.100.50), set up on a different VLAN to the AD servers, mgt-dc-001.testdom.local (192.168.200.133) and mgt-dc-002.testdom.local (192.168.200.134). The traffic is allowed or blocked by a network device, infra-vyatta.testdom.local, which allows me to simulate server failure. (Note - IPs and Hostnames have been changed for security)

I have currently got logging on maximum, and have captured log files, and network traffic, whilst all communication with the server mgt-dc-001.testdom.local is blocked by Vyatta, but all communication with server mgt-dc-002.testdom.local is allowed.

The attached logs show two events, the start of sssd service at 10:53:11, and the attempt to log on at 10:54:21.

What I would like to understand is the following.

Why is Port 88 of mgt-dc-001 marked as working, when the network communication attached clearly shows that it is not responding:

(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [_be_fo_set_port_status] (0x8000): Setting status: PORT_WORKING. Called from: src/providers/krb5/krb5_auth.c: krb5_auth_done: 1039
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [fo_set_port_status] (0x0100): Marking port 88 of server 'mgt-dc-001.testdom.local' as 'working'
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [set_server_common_status] (0x0100): Marking server 'mgt-dc-001.testdom.local' as 'working'
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [fo_set_port_status] (0x0400): Marking port 88 of duplicate server 'mgt-dc-001.testdom.local' as 'working'

Because it is not responding, Kerberos drops into offline mode, rather than looking to mgt-dc-002.testdom.local :

(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [krb5_auth_done] (0x0100): Backend is marked offline, retry later!

So it then tries to reset all services, as 'no server' is available :

(Fri Feb 17 10:54:48 2017) [sssd[be[TESTDOM.LOCAL]]] [fo_reset_services] (0x1000): Resetting all servers in all services

Whenever it attempts to check port 88 on mgt-dc-001 it seems to mark it working, never says it checks mgt-dc-002 and keeps falling back to offline authentication, which then doesn't work.

Also, for the AD and LDAP services it seems to be doing a lot of marking port 0 not working, but if a port is specified, it cannot resolve the name, as it attempts to do DNS resolution with the port attached ! I'm not sure if this is related.

What we expect is: SSSD to find that port 88 on mgt-dc-001.testdom.local is not working, and to mark it "not working" and to then fail over to mgt-dc-002. Occasionally, this seems to happen, but very occasionally. From the network communication you can see that it attempts port 88 communication to mgt-dc-002, which is returned, and port 88 communication to mgt-dc-001 which is not. You can see after LDAP communication on 389 fails to mgt-dc-001, it successfully connects to mgt-dc-002. You can see that DNS communication with mgt-dc-002 is succeeding.

I think I have attached all relevant configuration and log files.

Am I doing something obviously wrong, or is the failover in this version not working as it should ?

Thanks


Hmm, you might be right. I would expect this code:

(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [_be_fo_set_port_status] (0x8000): Setting status: PORT_WORKING. Called from: src/providers/krb5/krb5_auth.c: krb5_auth_done: 1039                                                      
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [fo_set_port_status] (0x0100): Marking port 88 of server 'mgt-dc-001.testdom.local' as 'working'                                                                                        
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [set_server_common_status] (0x0100): Marking server 'mgt-dc-001.testdom.local' as 'working'                                                                                             
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [fo_set_port_status] (0x0400): Marking port 88 of duplicate server 'mgt-dc-001.testdom.local' as 'working'                                                                              
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [krb5_mod_ccname] (0x4000): Save ccname [FILE:/tmp/krb5cc_1763401190_ech4R3] for user [gb103690].                                                                                       
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [ldb] (0x4000): start ldb transaction (nesting: 0)                                                                                                                                      
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [ldb] (0x4000): start ldb transaction (nesting: 1)                                                                                                                                      
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [ldb] (0x4000): Added timed event "ltdb_callback": 0x111c640

(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [ldb] (0x4000): Added timed event "ltdb_timeout": 0x111c770

(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [ldb] (0x4000): Running timer event 0x111c640 "ltdb_callback"

(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [ldb] (0x4000): Destroying timer event 0x111c770 "ltdb_timeout"

(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [ldb] (0x4000): Ending timer event 0x111c640 "ltdb_callback"

(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [ldb] (0x4000): commit ldb transaction (nesting: 1)                                                                                                                                     
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [ldb] (0x4000): commit ldb transaction (nesting: 0)                                                                                                                                     
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [krb5_auth_done] (0x0100): Backend is marked offline, retry later!                                                                                                                      
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [check_wait_queue] (0x1000): Wait queue for user [gb103690] is empty.                                                                                                                   
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [krb5_auth_queue_done] (0x1000): krb5_auth_queue request [0x1118f00] done.                                                                                                              
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [be_pam_handler_callback] (0x0100): Backend returned: (1, 9, <NULL>) [Provider is Offline]

To reset the offline status when the port is marked as working..

Fields changed

description: I am attempting to understand the failure of our Red Hat Linux devices to correctly fail over when using SSSD authentication to AD, the logs of which show marking inaccessible devices as 'working', and then after failure not attempting to communicate with the accessible one.

My test setup, on IBM SoftLayer, involves a single Red Hat 6.x VM, authtest.testdom.local (IP 192.168.100.50), set up on a different VLAN to the AD servers, mgt-dc-001.testdom.local (192.168.200.133) and mgt-dc-002.testdom.local (192.168.200.134). The traffic is allowed or blocked by a network device, infra-vyatta.testdom.local, which allows me to simulate server failure. (Note - IPs and Hostnames have been changed for security)

I have currently got logging on maximum, and have captured log files, and network traffic, whilst all communication with the server mgt-dc-001.testdom.local is blocked by Vyatta, but all communication with server mgt-dc-002.testdom.local is allowed.

The attached logs show two events, the start of sssd service at 10:53:11, and the attempt to log on at 10:54:21.

What I would like to understand is the following.

Why is Port 88 of mgt-dc-001 marked as working, when the network communication attached clearly shows that it is not responding:

(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [_be_fo_set_port_status] (0x8000): Setting status: PORT_WORKING. Called from: src/providers/krb5/krb5_auth.c: krb5_auth_done: 1039
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [fo_set_port_status] (0x0100): Marking port 88 of server 'mgt-dc-001.testdom.local' as 'working'
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [set_server_common_status] (0x0100): Marking server 'mgt-dc-001.testdom.local' as 'working'
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [fo_set_port_status] (0x0400): Marking port 88 of duplicate server 'mgt-dc-001.testdom.local' as 'working'

Because it is not responding, Kerberos drops into offline mode, rather than looking to mgt-dc-002.testdom.local :

(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [krb5_auth_done] (0x0100): Backend is marked offline, retry later!

So it then tries to reset all services, as 'no server' is available :

(Fri Feb 17 10:54:48 2017) [sssd[be[TESTDOM.LOCAL]]] [fo_reset_services] (0x1000): Resetting all servers in all services

Whenever it attempts to check port 88 on mgt-dc-001 it seems to mark it working, never says it checks mgt-dc-002 and keeps falling back to offline authentication, which then doesn't work.

Also, for the AD and LDAP services it seems to be doing a lot of marking port 0 not working, but if a port is specified, it cannot resolve the name, as it attempts to do DNS resolution with the port attached ! I'm not sure if this is related.

What we expect is: SSSD to find that port 88 on mgt-dc-001.testdom.local is not working, and to mark it "not working" and to then fail over to mgt-dc-002. Occasionally, this seems to happen, but very occasionally. From the network communication you can see that it attempts port 88 communication to mgt-dc-002, which is returned, and port 88 communication to mgt-dc-001 which is not. You can see after LDAP communication on 389 fails to mgt-dc-001, it successfully connects to mgt-dc-002. You can see that DNS communication with mgt-dc-002 is succeeding.

I think I have attached all relevant configuration and log files.

Am I doing something obviously wrong, or is the failover in this version not working as it should ?

Thanks => I am attempting to understand the failure of our Red Hat Linux devices to correctly fail over when using SSSD authentication to AD, the logs of which show marking inaccessible devices as 'working', and then after failure not attempting to communicate with the accessible one.

My test setup, on IBM SoftLayer, involves a single Red Hat 6.x VM, authtest.testdom.local (IP 192.168.100.50), set up on a different VLAN to the AD servers, mgt-dc-001.testdom.local (192.168.200.133) and mgt-dc-002.testdom.local (192.168.200.134). The traffic is allowed or blocked by a network device, infra-vyatta.testdom.local, which allows me to simulate server failure. (Note - IPs and Hostnames have been changed for security)

I have currently got logging on maximum, and have captured log files, and network traffic, whilst all communication with the server mgt-dc-001.testdom.local is blocked by Vyatta, but all communication with server mgt-dc-002.testdom.local is allowed.

The attached logs show two events, the start of sssd service at 10:53:11, and the attempt to log on at 10:54:21.

What I would like to understand is the following.

Why is Port 88 of mgt-dc-001 marked as working, when the network communication attached clearly shows that it is not responding:

{{{
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [_be_fo_set_port_status] (0x8000): Setting status: PORT_WORKING. Called from: src/providers/krb5/krb5_auth.c: krb5_auth_done: 1039
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [fo_set_port_status] (0x0100): Marking port 88 of server 'mgt-dc-001.testdom.local' as 'working'
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [set_server_common_status] (0x0100): Marking server 'mgt-dc-001.testdom.local' as 'working'
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [fo_set_port_status] (0x0400): Marking port 88 of duplicate server 'mgt-dc-001.testdom.local' as 'working'
}}}

Because it is not responding, Kerberos drops into offline mode, rather than looking to mgt-dc-002.testdom.local :
{{{
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [krb5_auth_done] (0x0100): Backend is marked offline, retry later!
}}}

So it then tries to reset all services, as 'no server' is available :

{{{
(Fri Feb 17 10:54:48 2017) [sssd[be[TESTDOM.LOCAL]]] [fo_reset_services] (0x1000): Resetting all servers in all services
}}}

Whenever it attempts to check port 88 on mgt-dc-001 it seems to mark it working, never says it checks mgt-dc-002 and keeps falling back to offline authentication, which then doesn't work.

Also, for the AD and LDAP services it seems to be doing a lot of marking port 0 not working, but if a port is specified, it cannot resolve the name, as it attempts to do DNS resolution with the port attached ! I'm not sure if this is related.

What we expect is: SSSD to find that port 88 on mgt-dc-001.testdom.local is not working, and to mark it "not working" and to then fail over to mgt-dc-002. Occasionally, this seems to happen, but very occasionally. From the network communication you can see that it attempts port 88 communication to mgt-dc-002, which is returned, and port 88 communication to mgt-dc-001 which is not. You can see after LDAP communication on 389 fails to mgt-dc-001, it successfully connects to mgt-dc-002. You can see that DNS communication with mgt-dc-002 is succeeding.

I think I have attached all relevant configuration and log files.

Am I doing something obviously wrong, or is the failover in this version not working as it should ?

Thanks

We're take a look once the core features for 1.15 are done, but as I said, this looks like a bug.

milestone: NEEDS_TRIAGE => SSSD 1.15.3

Metadata Update from @chrismwheeler:
- Issue set to the milestone: SSSD 1.15.3

7 years ago

Metadata Update from @jhrozek:
- Custom field design_review reset
- Custom field mark reset
- Custom field patch reset
- Custom field review reset
- Custom field sensitive reset
- Custom field testsupdated reset
- Issue close_status updated to: None
- Issue set to the milestone: SSSD 1.15.4 (was: SSSD 1.15.3)

7 years ago

Hi, I'm sorry that it took so long, but unfortunately the debug logs are not available anymore. Do you still have this issue? Could you please attach new logs?

However, from the description, it does look similar as https://pagure.io/SSSD/sssd/issue/3401

Metadata Update from @pbrezina:
- Custom field design_review reset (from false)
- Custom field mark reset (from false)
- Custom field patch reset (from false)
- Custom field review reset (from false)
- Custom field sensitive reset (from false)
- Custom field testsupdated reset (from false)

6 years ago

I see you have very low timeout 1 second which from the logs doesn't seem to be enough. Can you try to set dns_resolver_timeout to some higher value? It also should be bigger then dns_resolver_op_timeout which was not very clear until we mentioned it in documentation in recent version. See: https://pagure.io/SSSD/sssd/issue/3217#comment-223033

Set it for example to dns_resolver_timeout = 6, dns_resolver_op_timeout = 3.

Metadata Update from @pbrezina:
- Custom field design_review reset (from false)
- Custom field mark reset (from false)
- Custom field patch reset (from false)
- Custom field review reset (from false)
- Custom field sensitive reset (from false)
- Custom field testsupdated reset (from false)

6 years ago

We've decided to rename the 1.15.4 release to 1.16.0, but since we need to release 1.16.0 tomorrow, I'm bumping this ticket to 1.16.1

Metadata Update from @jhrozek:
- Custom field design_review reset (from false)
- Custom field mark reset (from false)
- Custom field patch reset (from false)
- Custom field review reset (from false)
- Custom field sensitive reset (from false)
- Custom field testsupdated reset (from false)

6 years ago

Metadata Update from @jhrozek:
- Custom field design_review reset (from false)
- Custom field mark reset (from false)
- Custom field patch reset (from false)
- Custom field review reset (from false)
- Custom field sensitive reset (from false)
- Custom field testsupdated reset (from false)
- Issue set to the milestone: SSSD 1.16.1 (was: SSSD 1.15.4)

6 years ago

I see you have very low timeout 1 second which from the logs doesn't seem to be enough. Can you try to set dns_resolver_timeout to some higher value? It also should be bigger then dns_resolver_op_timeout which was not very clear until we mentioned it in documentation in recent version. See: https://pagure.io/SSSD/sssd/issue/3217#comment-223033
Set it for example to dns_resolver_timeout = 6, dns_resolver_op_timeout = 3.

Unfortunately this doesn't resolve the Kerberos failover issues.

Still seeing port marked as working when this is not true :
(Tue Nov 7 10:03:00 2017) [sssd[be[TESTDOM.LOCAL]]] [_be_fo_set_port_status] (0x8000): Setting status: PORT_WORKING. Called from: src/providers/krb5/krb5_auth.c: krb5_auth_done: 1093
(Tue Nov 7 10:03:00 2017) [sssd[be[TESTDOM.LOCAL]]] [fo_set_port_status] (0x0100): Marking port 0 of server 'mgt-dc-001.testdom.local' as 'working'

Which leads to failure without trying the backup server :
(Tue Nov 7 10:03:00 2017) [sssd[be[SANDBOX.LOCAL]]] [krb5_auth_done] (0x0100): Backend is marked offline, retry later!

Metadata Update from @jhrozek:
- Custom field design_review reset (from false)
- Custom field mark reset (from false)
- Custom field patch reset (from false)
- Custom field review reset (from false)
- Custom field sensitive reset (from false)
- Custom field testsupdated reset (from false)
- Issue tagged with: postpone-to-2-0

6 years ago

Metadata Update from @jhrozek:
- Custom field design_review reset (from false)
- Custom field mark reset (from false)
- Custom field patch reset (from false)
- Custom field review reset (from false)
- Custom field sensitive reset (from false)
- Custom field testsupdated reset (from false)
- Issue untagged with: postpone-to-2-0
- Issue set to the milestone: SSSD 2.0 (was: SSSD 1.16.1)

6 years ago

Metadata Update from @jhrozek:
- Custom field design_review reset (from false)
- Custom field mark reset (from false)
- Custom field patch reset (from false)
- Custom field review reset (from false)
- Custom field sensitive reset (from false)
- Custom field testsupdated reset (from false)
- Issue set to the milestone: SSSD 2.1 (was: SSSD 2.0)

5 years ago

Metadata Update from @jhrozek:
- Custom field design_review reset (from false)
- Custom field mark reset (from false)
- Custom field patch reset (from false)
- Custom field review reset (from false)
- Custom field sensitive reset (from false)
- Custom field testsupdated reset (from false)
- Issue set to the milestone: SSSD 2.2 (was: SSSD 2.1)

5 years ago

Metadata Update from @jhrozek:
- Custom field design_review reset (from false)
- Custom field mark reset (from false)
- Custom field patch reset (from false)
- Custom field review reset (from false)
- Custom field sensitive reset (from false)
- Custom field testsupdated reset (from false)
- Issue set to the milestone: SSSD 2.3 (was: SSSD 2.2)

4 years ago

Metadata Update from @thalman:
- Custom field design_review reset (from false)
- Custom field mark reset (from false)
- Custom field patch reset (from false)
- Custom field review reset (from false)
- Custom field sensitive reset (from false)
- Custom field testsupdated reset (from false)
- Issue tagged with: Canditate to close

4 years ago

I'm sorry that this took such a long time. Do you still have issues?

Metadata Update from @pbrezina:
- Custom field design_review reset (from false)
- Custom field mark reset (from false)
- Custom field patch reset (from false)
- Custom field review reset (from false)
- Custom field sensitive reset (from false)
- Custom field testsupdated reset (from false)
- Issue untagged with: Canditate to close

4 years ago

Fortunately our AD servers haven't gone down, so we haven't experienced it.

Is there reason to believe the issue has been fixed, I could set up a test environment to simulate the issue if required?

SSSD is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in SSSD's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/SSSD/sssd/issues/4345

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata Update from @pbrezina:
- Issue close_status updated to: cloned-to-github
- Issue status updated to: Closed (was: Open)

3 years ago

Login to comment on this ticket.

Metadata
Attachments 1
Attached 6 years ago View Comment