Learn more about these different git repos.
Other Git URLs
DNS resolver in sssd seems to be buggy causing sssd to occasionally (usually 1x per 2 hours, sometimes less frequently, but there is pattern in the occurrences) go offline.
Here is the log file:
(Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [be_resolve_server_process] (0x0080): Couldn't resolve server (dcdub1.dublin.ad.s3group.com), resolver returned (5) (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [be_resolve_server_process] (0x0040): Failed to find a server after 10 attempts (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [be_resolve_server_done] (0x1000): Server resolution failed: 5 (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [be_mark_dom_offline] (0x1000): Marking back end offline (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [be_mark_offline] (0x2000): Going offline! (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [be_mark_offline] (0x2000): Enable check_if_online_ptask. (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [be_ptask_enable] (0x0080): Task [Check if online (periodic)]: already enabled (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [be_run_offline_cb] (0x0080): Going offline. Running callbacks. (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [child_handler_setup] (0x2000): Setting up signal handler up for pid [20011] (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [child_handler_setup] (0x2000): Signal handler set up for pid [20011] (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [sbus_dispatch] (0x4000): dbus conn: 0x24b5e60 (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [sbus_dispatch] (0x4000): Dispatching. (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [sbus_message_handler] (0x2000): Received SBUS method org.freedesktop.sssd.dataprovider.getAccountInfo on path /org/freedesktop/sssd/dataprovider (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [sbus_get_sender_id_send] (0x2000): Not a sysbus message, quit (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [be_get_account_info] (0x0200): Got request for [0x1001][FAST BE_REQ_USER][1][idnumber=14001] (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [be_get_account_info] (0x0100): Request processed. Returned 1,11,Fast reply - offline (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [be_req_set_domain] (0x0400): Changing request domain from [default] to [default] (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [write_pipe_handler] (0x0400): All data has been sent! (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [resolv_getsrv_done] (0x1000): Using TTL [600] (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [request_watch_destructor] (0x0400): Deleting request watch (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [fo_discover_srv_done] (0x0400): Got answer. Processing... (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [fo_discover_srv_done] (0x0400): Got 11 servers (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [ad_srv_plugin_servers_done] (0x0400): Got 2 primary and 11 backup servers (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [fo_add_server_to_list] (0x0400): Server 'dcduba.dublin.ad.s3group.com:389' for service 'AD' is already present (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [fo_add_server_to_list] (0x0400): Server 'dcdub1.dublin.ad.s3group.com:389' for service 'AD' is already present (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [fo_add_server_to_list] (0x0400): Server 'dcphil1.dublin.ad.s3group.com:389' for service 'AD' is already present (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [fo_add_server_to_list] (0x0400): Server 'dccork2.dublin.ad.s3group.com:389' for service 'AD' is already present (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [fo_add_server_to_list] (0x0400): Server 'dcwro2.dublin.ad.s3group.com:389' for service 'AD' is already present (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [fo_add_server_to_list] (0x0400): Server 'dcwro1.dublin.ad.s3group.com:389' for service 'AD' is already present (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [fo_add_server_to_list] (0x0400): Server 'dcpra2.dublin.ad.s3group.com:389' for service 'AD' is already present (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [fo_add_server_to_list] (0x0400): Server 'dcwro3.dublin.ad.s3group.com:389' for service 'AD' is already present (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [fo_add_server_to_list] (0x0400): Server 'dcpra.dublin.ad.s3group.com:389' for service 'AD' is already present (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [fo_add_server_to_list] (0x0400): Server 'dccork1.dublin.ad.s3group.com:389' for service 'AD' is already present (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [fo_add_server_to_list] (0x0400): Server 'dclis2.dublin.ad.s3group.com:389' for service 'AD' is already present (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [fo_add_server_to_list] (0x0400): Server 'dcduba.dublin.ad.s3group.com:389' for service 'AD' is already present (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [fo_add_server_to_list] (0x0400): Server 'dcdub1.dublin.ad.s3group.com:389' for service 'AD' is already present (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [resolve_srv_done] (0x0400): SRV lookup did not return any new server. (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [set_srv_data_status] (0x0100): Marking SRV lookup of service 'AD' as 'not resolved' (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [be_resolve_server_process] (0x0080): Couldn't resolve server (SRV lookup meta-server), resolver returned (1432158231) (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [be_resolve_server_process] (0x1000): Trying with the next one! (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [fo_resolve_service_send] (0x0100): Trying to resolve service 'AD' (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [get_server_status] (0x1000): Status of server 'dcdub1.dublin.ad.s3group.com' is 'name resolved' (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [fo_resolve_service_activate_timeout] (0x2000): Resolve timeout set to 6 seconds (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [resolve_srv_send] (0x0200): The status of SRV lookup is not resolved (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [fo_set_port_status] (0x0100): Marking port 0 of server '(no name)' as 'not working' (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [be_resolve_server_process] (0x0080): Couldn't resolve server (dcdub1.dublin.ad.s3group.com), resolver returned (5) (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [be_resolve_server_process] (0x1000): Trying with the next one! (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [fo_resolve_service_send] (0x0100): Trying to resolve service 'AD' (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [get_server_status] (0x1000): Status of server 'dcdub1.dublin.ad.s3group.com' is 'name resolved' (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [fo_resolve_service_activate_timeout] (0x2000): Resolve timeout set to 6 seconds (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [resolve_srv_send] (0x0200): The status of SRV lookup is not resolved (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [fo_set_port_status] (0x0100): Marking port 0 of server '(no name)' as 'not working' (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [be_resolve_server_process] (0x0080): Couldn't resolve server (dcdub1.dublin.ad.s3group.com), resolver returned (5) (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [be_resolve_server_process] (0x1000): Trying with the next one! (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [fo_resolve_service_send] (0x0100): Trying to resolve service 'AD' (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [get_server_status] (0x1000): Status of server 'dcdub1.dublin.ad.s3group.com' is 'name resolved' (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [fo_resolve_service_activate_timeout] (0x2000): Resolve timeout set to 6 seconds (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [resolve_srv_send] (0x0200): The status of SRV lookup is not resolved (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [fo_set_port_status] (0x0100): Marking port 0 of server '(no name)' as 'not working' (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [be_resolve_server_process] (0x0080): Couldn't resolve server (dcdub1.dublin.ad.s3group.com), resolver returned (5) (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [be_resolve_server_process] (0x1000): Trying with the next one! (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [fo_resolve_service_send] (0x0100): Trying to resolve service 'AD' (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [get_server_status] (0x1000): Status of server 'dcdub1.dublin.ad.s3group.com' is 'name resolved' (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [fo_resolve_service_activate_timeout] (0x2000): Resolve timeout set to 6 seconds (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [resolve_srv_send] (0x0200): The status of SRV lookup is not resolved (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [fo_set_port_status] (0x0100): Marking port 0 of server '(no name)' as 'not working' (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [be_resolve_server_process] (0x0080): Couldn't resolve server (dcdub1.dublin.ad.s3group.com), resolver returned (5) (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [be_resolve_server_process] (0x0040): Failed to find a server after 10 attempts (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [be_resolve_server_done] (0x1000): Server resolution failed: 5 (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [be_mark_dom_offline] (0x1000): Marking back end offline (Wed Aug 10 02:22:24 2016) [sssd[be[default]]] [be_mark_offline] (0x2000): Going offline!
I am sure DNS servers are working fine. This behavior is quite painful as it causes malfunction of applications running on the server.
Linked to Bugzilla bug: https://bugzilla.redhat.com/show_bug.cgi?id=1367435 (Red Hat Enterprise Linux 6)
rhbz: => [https://bugzilla.redhat.com/show_bug.cgi?id=1367435 1367435]
Fields changed
owner: somebody => pbrezina status: new => assigned
Hi, can you send me your DNS structure and SRV records please? I'm trying to reproduce but without any luck so far.
Hi, do you want me to send a complete zone dump? If so, can we make this case public? Quite private, we do not want to share with anyone else...
Hi, yes, if possible, complete dump would be best. AFAIK trac doesn't support private tickets. Feel free to attach it to the customer case, I can access it there. Thanks.
Ok, zone dump has been attached to Case #01684879
Thank you. Can you also clear logs, set full debug level 0xfff0 and reproduce please?
I have an idea of what might be happening, but I'm still failing to create conditions that would trigger this error code. The logs attached to sos report are missing few important messages. Thank you.
Attached the full log to the support case above.
Thank you. The problem seems to be when multiple failover requests happened to run in parallel. Could you please try the following scratch build please? It's for RHEL6.7 version, if you'd like me to build another one, feel free to ask.
https://pbrezina.fedorapeople.org/scratch/failover/
I am currently running sssd ver 13.3. Does your build have all functionality I am using? I am talking mostly about the AD provider for automounter?
it would be better to add explicit rpm -q sssd, especially if you're running RHEL or CentOS, because there the version number does not match upstream exactly
Please build for me one based on 13.3 branch as this one lacks autofs&AD support. Thanks.
milestone: NEEDS_TRIAGE => SSSD 1.13.5
I uploaded new packages. Please, try these: https://pbrezina.fedorapeople.org/scratch/failover/
Tried, still buggy - different bug now:
(Thu Aug 25 14:51:54 2016) [sssd[be[default]]] [resolve_srv_done] (0x0020): BUG: state->meta->next is NULL (Thu Aug 25 14:51:54 2016) [sssd[be[default]]] [be_resolve_server_process] (0x0080): Couldn't resolve server (SRV lookup meta-server), resolver returned (1432158209) (Thu Aug 25 14:51:54 2016) [sssd[be[default]]] [be_resolve_server_process] (0x0080): Couldn't resolve server (myserver1.domain.com), resolver returned (5) (Thu Aug 25 14:51:54 2016) [sssd[be[default]]] [be_resolve_server_process] (0x0080): Couldn't resolve server (myserver1.domain.com), resolver returned (5) (Thu Aug 25 14:51:54 2016) [sssd[be[default]]] [be_resolve_server_process] (0x0080): Couldn't resolve server (myserver1.domain.com), resolver returned (5) (Thu Aug 25 14:51:54 2016) [sssd[be[default]]] [be_resolve_server_process] (0x0080): Couldn't resolve server (myserver1.domain.com), resolver returned (5) (Thu Aug 25 14:51:54 2016) [sssd[be[default]]] [be_resolve_server_process] (0x0080): Couldn't resolve server (myserver1.domain.com), resolver returned (5) (Thu Aug 25 14:51:54 2016) [sssd[be[default]]] [be_resolve_server_process] (0x0080): Couldn't resolve server (myserver1.domain.com), resolver returned (5) (Thu Aug 25 14:51:54 2016) [sssd[be[default]]] [be_resolve_server_process] (0x0080): Couldn't resolve server (myserver1.domain.com), resolver returned (5) (Thu Aug 25 14:51:54 2016) [sssd[be[default]]] [be_resolve_server_process] (0x0080): Couldn't resolve server (myserver1.domain.com), resolver returned (5) (Thu Aug 25 14:51:54 2016) [sssd[be[default]]] [be_resolve_server_process] (0x0080): Couldn't resolve server (myserver1.domain.com), resolver returned (5) (Thu Aug 25 14:51:54 2016) [sssd[be[default]]] [be_resolve_server_process] (0x0040): Failed to find a server after 10 attempts (Thu Aug 25 14:51:54 2016) [sssd[be[default]]] [be_run_offline_cb] (0x0080): Going offline. Running callbacks.
Hi, try these packages:
https://pbrezina.fedorapeople.org/scratch/failover2/
I reproduced it with code modification to see what's exactly happening there and this build works for me. Let us see if it works in your environment. Thanks.
Thanks, could you please at least change (increase) the minor version number - these packages have the same version as the previous build - I want to make sure I update it correctly.
Ok, I increased minor number: https://pbrezina.fedorapeople.org/scratch/failover2/
Trying... so far, so good. Will give it some more time. What was the problem?
Great! Thanks.
See: https://fedorapeople.org/cgit/pbrezina/public_git/sssd.git/commit/?h=failover&id=11b6d2b7623208c7a3e148f4128e9f49c35a6ce1
Linked to Bugzilla bug: https://bugzilla.redhat.com/show_bug.cgi?id=1332309 (Red Hat Enterprise Linux 7)
rhbz: [https://bugzilla.redhat.com/show_bug.cgi?id=1367435 1367435] => [https://bugzilla.redhat.com/show_bug.cgi?id=1367435 1367435], [https://bugzilla.redhat.com/show_bug.cgi?id=1332309 1332309]
Quick note: The build I was given (namely the sssd-client package) to test is missing the 32-bit NSS libraries. These seems to be normally included in the 64bit build as well. Just a heads up.
resolution: => fixed status: assigned => closed
Metadata Update from @ondrejv2: - Issue assigned to pbrezina - Issue set to the milestone: SSSD 1.13.5
SSSD is moving from Pagure to Github. This means that new issues and pull requests will be accepted only in SSSD's github repository.
This issue has been cloned to Github and is available here: - https://github.com/SSSD/sssd/issues/4164
If you want to receive further updates on the issue, please navigate to the github issue and click on subscribe button.
subscribe
Thank you for understanding. We apologize for all inconvenience.
Log in to comment on this ticket.