#3131 DNS resolver in sssd is buggy
Closed: Fixed None Opened 7 years ago by ondrejv2.

DNS resolver in sssd seems to be buggy causing sssd to occasionally (usually 1x per 2 hours, sometimes less frequently, but there is pattern in the occurrences) go offline.

Here is the log file:

(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [be_resolve_server_process] (0x0080): Couldn't resolve server (dcdub1.dublin.ad.s3group.com), resolver returned (5)
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [be_resolve_server_process] (0x0040): Failed to find a server after 10 attempts
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [be_resolve_server_done] (0x1000): Server resolution failed: 5
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [be_mark_dom_offline] (0x1000): Marking back end offline
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [be_mark_offline] (0x2000): Going offline!
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [be_mark_offline] (0x2000): Enable check_if_online_ptask.
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [be_ptask_enable] (0x0080): Task [Check if online (periodic)]: already enabled
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [be_run_offline_cb] (0x0080): Going offline. Running callbacks.
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [child_handler_setup] (0x2000): Setting up signal handler up for pid [20011]
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [child_handler_setup] (0x2000): Signal handler set up for pid [20011]
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [sbus_dispatch] (0x4000): dbus conn: 0x24b5e60
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [sbus_dispatch] (0x4000): Dispatching.
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [sbus_message_handler] (0x2000): Received SBUS method org.freedesktop.sssd.dataprovider.getAccountInfo on path /org/freedesktop/sssd/dataprovider
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [sbus_get_sender_id_send] (0x2000): Not a sysbus message, quit
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [be_get_account_info] (0x0200): Got request for [0x1001][FAST BE_REQ_USER][1][idnumber=14001]
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [be_get_account_info] (0x0100): Request processed. Returned 1,11,Fast reply - offline
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [be_req_set_domain] (0x0400): Changing request domain from [default] to [default]
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [write_pipe_handler] (0x0400): All data has been sent!
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [resolv_getsrv_done] (0x1000): Using TTL [600]
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [request_watch_destructor] (0x0400): Deleting request watch
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [fo_discover_srv_done] (0x0400): Got answer. Processing...
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [fo_discover_srv_done] (0x0400): Got 11 servers
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [ad_srv_plugin_servers_done] (0x0400): Got 2 primary and 11 backup servers
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [fo_add_server_to_list] (0x0400): Server 'dcduba.dublin.ad.s3group.com:389' for service 'AD' is already present
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [fo_add_server_to_list] (0x0400): Server 'dcdub1.dublin.ad.s3group.com:389' for service 'AD' is already present
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [fo_add_server_to_list] (0x0400): Server 'dcphil1.dublin.ad.s3group.com:389' for service 'AD' is already present
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [fo_add_server_to_list] (0x0400): Server 'dccork2.dublin.ad.s3group.com:389' for service 'AD' is already present
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [fo_add_server_to_list] (0x0400): Server 'dcwro2.dublin.ad.s3group.com:389' for service 'AD' is already present
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [fo_add_server_to_list] (0x0400): Server 'dcwro1.dublin.ad.s3group.com:389' for service 'AD' is already present
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [fo_add_server_to_list] (0x0400): Server 'dcpra2.dublin.ad.s3group.com:389' for service 'AD' is already present
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [fo_add_server_to_list] (0x0400): Server 'dcwro3.dublin.ad.s3group.com:389' for service 'AD' is already present
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [fo_add_server_to_list] (0x0400): Server 'dcpra.dublin.ad.s3group.com:389' for service 'AD' is already present
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [fo_add_server_to_list] (0x0400): Server 'dccork1.dublin.ad.s3group.com:389' for service 'AD' is already present
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [fo_add_server_to_list] (0x0400): Server 'dclis2.dublin.ad.s3group.com:389' for service 'AD' is already present
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [fo_add_server_to_list] (0x0400): Server 'dcduba.dublin.ad.s3group.com:389' for service 'AD' is already present
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [fo_add_server_to_list] (0x0400): Server 'dcdub1.dublin.ad.s3group.com:389' for service 'AD' is already present
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [resolve_srv_done] (0x0400): SRV lookup did not return any new server.
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [set_srv_data_status] (0x0100): Marking SRV lookup of service 'AD' as 'not resolved'
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [be_resolve_server_process] (0x0080): Couldn't resolve server (SRV lookup meta-server), resolver returned (1432158231)
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [be_resolve_server_process] (0x1000): Trying with the next one!
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [fo_resolve_service_send] (0x0100): Trying to resolve service 'AD'
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [get_server_status] (0x1000): Status of server 'dcdub1.dublin.ad.s3group.com' is 'name resolved'
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [fo_resolve_service_activate_timeout] (0x2000): Resolve timeout set to 6 seconds
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [resolve_srv_send] (0x0200): The status of SRV lookup is not resolved
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [fo_set_port_status] (0x0100): Marking port 0 of server '(no name)' as 'not working'
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [be_resolve_server_process] (0x0080): Couldn't resolve server (dcdub1.dublin.ad.s3group.com), resolver returned (5)
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [be_resolve_server_process] (0x1000): Trying with the next one!
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [fo_resolve_service_send] (0x0100): Trying to resolve service 'AD'
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [get_server_status] (0x1000): Status of server 'dcdub1.dublin.ad.s3group.com' is 'name resolved'
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [fo_resolve_service_activate_timeout] (0x2000): Resolve timeout set to 6 seconds
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [resolve_srv_send] (0x0200): The status of SRV lookup is not resolved
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [fo_set_port_status] (0x0100): Marking port 0 of server '(no name)' as 'not working'
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [be_resolve_server_process] (0x0080): Couldn't resolve server (dcdub1.dublin.ad.s3group.com), resolver returned (5)
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [be_resolve_server_process] (0x1000): Trying with the next one!
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [fo_resolve_service_send] (0x0100): Trying to resolve service 'AD'
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [get_server_status] (0x1000): Status of server 'dcdub1.dublin.ad.s3group.com' is 'name resolved'
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [fo_resolve_service_activate_timeout] (0x2000): Resolve timeout set to 6 seconds
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [resolve_srv_send] (0x0200): The status of SRV lookup is not resolved
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [fo_set_port_status] (0x0100): Marking port 0 of server '(no name)' as 'not working'
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [be_resolve_server_process] (0x0080): Couldn't resolve server (dcdub1.dublin.ad.s3group.com), resolver returned (5)
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [be_resolve_server_process] (0x1000): Trying with the next one!
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [fo_resolve_service_send] (0x0100): Trying to resolve service 'AD'
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [get_server_status] (0x1000): Status of server 'dcdub1.dublin.ad.s3group.com' is 'name resolved'
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [fo_resolve_service_activate_timeout] (0x2000): Resolve timeout set to 6 seconds
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [resolve_srv_send] (0x0200): The status of SRV lookup is not resolved
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [fo_set_port_status] (0x0100): Marking port 0 of server '(no name)' as 'not working'
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [be_resolve_server_process] (0x0080): Couldn't resolve server (dcdub1.dublin.ad.s3group.com), resolver returned (5)
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [be_resolve_server_process] (0x0040): Failed to find a server after 10 attempts
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [be_resolve_server_done] (0x1000): Server resolution failed: 5
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [be_mark_dom_offline] (0x1000): Marking back end offline
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [be_mark_offline] (0x2000): Going offline!

I am sure DNS servers are working fine. This behavior is quite painful as it causes malfunction of applications running on the server.


Fields changed

owner: somebody => pbrezina
status: new => assigned

Hi, can you send me your DNS structure and SRV records please? I'm trying to reproduce but without any luck so far.

Hi, do you want me to send a complete zone dump? If so, can we make this case public? Quite private, we do not want to share with anyone else...

Hi, yes, if possible, complete dump would be best. AFAIK trac doesn't support private tickets. Feel free to attach it to the customer case, I can access it there. Thanks.

Ok, zone dump has been attached to Case #01684879

Thank you. Can you also clear logs, set full debug level 0xfff0 and reproduce please?

I have an idea of what might be happening, but I'm still failing to create conditions that would trigger this error code. The logs attached to sos report are missing few important messages. Thank you.

Attached the full log to the support case above.

Thank you. The problem seems to be when multiple failover requests happened to run in parallel. Could you please try the following scratch build please? It's for RHEL6.7 version, if you'd like me to build another one, feel free to ask.

https://pbrezina.fedorapeople.org/scratch/failover/

I am currently running sssd ver 13.3. Does your build have all functionality I am using? I am talking mostly about the AD provider for automounter?

it would be better to add explicit rpm -q sssd, especially if you're running RHEL or CentOS, because there the version number does not match upstream exactly

Please build for me one based on 13.3 branch as this one lacks autofs&AD support.
Thanks.

Fields changed

milestone: NEEDS_TRIAGE => SSSD 1.13.5

Tried, still buggy - different bug now:

(Thu Aug 25 14:51:54 2016) [sssd[be[default]]] [resolve_srv_done] (0x0020): BUG: state->meta->next is NULL
(Thu Aug 25 14:51:54 2016) [sssd[be[default]]] [be_resolve_server_process] (0x0080): Couldn't resolve server (SRV lookup meta-server), resolver returned (1432158209)
(Thu Aug 25 14:51:54 2016) [sssd[be[default]]] [be_resolve_server_process] (0x0080): Couldn't resolve server (myserver1.domain.com), resolver returned (5)
(Thu Aug 25 14:51:54 2016) [sssd[be[default]]] [be_resolve_server_process] (0x0080): Couldn't resolve server (myserver1.domain.com), resolver returned (5)
(Thu Aug 25 14:51:54 2016) [sssd[be[default]]] [be_resolve_server_process] (0x0080): Couldn't resolve server (myserver1.domain.com), resolver returned (5)
(Thu Aug 25 14:51:54 2016) [sssd[be[default]]] [be_resolve_server_process] (0x0080): Couldn't resolve server (myserver1.domain.com), resolver returned (5)
(Thu Aug 25 14:51:54 2016) [sssd[be[default]]] [be_resolve_server_process] (0x0080): Couldn't resolve server (myserver1.domain.com), resolver returned (5)
(Thu Aug 25 14:51:54 2016) [sssd[be[default]]] [be_resolve_server_process] (0x0080): Couldn't resolve server (myserver1.domain.com), resolver returned (5)
(Thu Aug 25 14:51:54 2016) [sssd[be[default]]] [be_resolve_server_process] (0x0080): Couldn't resolve server (myserver1.domain.com), resolver returned (5)
(Thu Aug 25 14:51:54 2016) [sssd[be[default]]] [be_resolve_server_process] (0x0080): Couldn't resolve server (myserver1.domain.com), resolver returned (5)
(Thu Aug 25 14:51:54 2016) [sssd[be[default]]] [be_resolve_server_process] (0x0080): Couldn't resolve server (myserver1.domain.com), resolver returned (5)
(Thu Aug 25 14:51:54 2016) [sssd[be[default]]] [be_resolve_server_process] (0x0040): Failed to find a server after 10 attempts
(Thu Aug 25 14:51:54 2016) [sssd[be[default]]] [be_run_offline_cb] (0x0080): Going offline. Running callbacks.

Hi, try these packages:

https://pbrezina.fedorapeople.org/scratch/failover2/

I reproduced it with code modification to see what's exactly happening there and this build works for me. Let us see if it works in your environment. Thanks.

Thanks, could you please at least change (increase) the minor version number - these packages have the same version as the previous build - I want to make sure I update it correctly.

Trying... so far, so good. Will give it some more time.
What was the problem?

Quick note: The build I was given (namely the sssd-client package) to test is missing the 32-bit NSS libraries. These seems to be normally included in the 64bit build as well.
Just a heads up.

resolution: => fixed
status: assigned => closed

Metadata Update from @ondrejv2:
- Issue assigned to pbrezina
- Issue set to the milestone: SSSD 1.13.5

7 years ago

SSSD is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in SSSD's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/SSSD/sssd/issues/4164

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Login to comment on this ticket.

Metadata