#10233 openQA worker hosts getting stuck with high load, unkillable `ldap_child` processes
Closed: Fixed 2 years ago by kevin. Opened 2 years ago by adamwill.

Describe what you would like us to do:


Re-deploy openqa-x86-worker04.iad2.fedoraproject.org . Just burn it down and start over. It has developed an odd problem where shortly after boot it gets kinda stuck, I think due to a flood of /usr/libexec/sssd/ldap_child processes. If you have an active console you can do simple stuff like ps and ls but anything more complicated fails, and you can't start a new ssh connection, and all the openQA jobs it's running get stuck. As worker hosts are totally disposable it's probably easier to just blow it away and re-deploy than it is to try and debug the problem. Especially since it just hung in the middle of a dnf update I was running to see if it'd help.

When do you need this to be done by? (YYYY/MM/DD)


ASAP. It's the only x86_64 worker host for the lab openQA instance, so until it's working properly, no x86_64 tests are running.


Metadata Update from @mobrien:
- Issue assigned to mobrien

2 years ago

Metadata Update from @zlopez:
- Issue tagged with: low-trouble, medium-gain, ops

2 years ago

Metadata Update from @mohanboddu:
- Issue priority set to: Waiting on Assignee (was: Needs Review)

2 years ago

Sorry for the delay on this. It got into a funny state where 2 Operating systems were installed and trying to take control of the drives(my fault).

The server is back up and running on F34. I ran the ansible playbook openqa-workers and it failed running the handlers at the end on the restart network handler.

I ran it once more to be safe and it completed but it did not run the handlers and a task was ignored due to the network.service being down.

I will hand it back over to you from here and leave the ticket open until you are happy to close. Reach out if you need any further help

so, hmm, it looks like it gets stuck for a long time in the linux-system-roles.network role. Which, well, the fact that it gets stuck isn't surprising but the fact that it runs at all kinda is. I thought it wasn't supposed to yet, as we haven't converted the openQA workers to that format yet. See comments on https://pagure.io/fedora-infrastructure/issue/9695 . This could be part of the problem. roles/base/meta/main.yml says:

    - { role: linux-system-roles.network, when: network_connections is defined, tags: ['linux-system-roles.network'] }

and AFAICS we do not set network_connections on this host explicitly anywhere, but the role still runs, which seems strange.

I guess I could use this as a chance to try and make system roles networking work, but I'd still like to know why we're getting stuck here. @kevin , any thoughts?

There's also still something else odd going on, as several steps in the playbooks that should be quite fast take multiple minutes. I can't ssh into the box directly so I can't poke around and see what the issue is.

So, it looks like unfortunately the same problem we were having has persisted across the re-deployment :(

When the system first boots it works fine and runs tests. Some time after boot, though, it seems to get into a sort of 'stuck' state. The system load is astronomical (in the hundreds), qemu processes are running but in the openQA web UI the jobs don't show (and will eventually die after a two hour timeout), and there are hundreds of /usr/libexec/sssd/ldap_child --debug-microseconds=0 --debug-timestamps=1 --debug-fd=32 --debug-level=0x0070 processes shown in the ps output, none of which seems to be killable with SIGTERM or SIGKILL. the number of these process increases gradually (it's gone from 309 to 361 in the last fifteen minutes or so) and the system gets more and more 'stuck' (you can't launch complex processes from an existing shell, and you can't ssh in anew) until a side-channel reboot is needed, at which point the same cycle starts over.

I've no idea what's causing this or how to stop it, though :/

ok. I updated the drac and bios firmware on the machine, updated it to the 5.14.x kernel in updates-testing and then added in the 4 spare drives.

I guess lets see if it's better now? Should be done reshaping the raid in about 3 hours or so.

well, none of those things seems particularly likely to solve the problem, but hey, will keep my digits crossed. :D

Alas, it's already hosed up again...

I was watching the raid reshape and it's getting slower and slower.

  [==>..................]  reshape = 10.9% (63805440/583879680) finish=132118.3min speed=65K/sec

I'm not sure what to try next.

I've rebooted it and did a 'systemctl stop openqa*' lets see if it's at all stable until it finishes rebuilding the raid.

the sssd processes shouldn't have anything to do with openQA, btw. I'm assuming they're do with freeipa client stuff. but I don't see any way in which this box is configured differently from any of the others, so no idea why it's only a problem on this box.

ok, the raid rebuild finished, I got everything resized. I tweaked some bios options...

and so far it seems more stable. Load is in the 20s... but I guess we will see.

sadly, I think it's stuck again :( interesting that it didn't get stuck with the workers not running, though. Maybe we could cut its worker count a bit and see if it's happier that way.

I rebooted it and cut it from 30 to 20 workers overnight, but it's stuck again this morning :/ I do wonder if there's a hardware issue. I'll maybe try rebooting it again, stopping the workers, and running smartctl?

so, it's still happening, and we spotted it's actually also happening on some other worker hosts, at least x86-worker02 and I think p09-worker02.

x86-worker02 is just getting into the cycle of increasing loads and stuck ldap_child processes right now. I noticed that /var/log/sssd/sssd_fedoraproject.org.log seems to show it trying something over and over and constantly failing...every 40 seconds or so another failure shows up in the log. A single cycle seems to look like this:

(2021-09-30  6:19:35): [be[fedoraproject.org]] [fo_resolve_service_send] (0x0020): [RID#55] No available servers for service 'IPA'
********************** PREVIOUS MESSAGE WAS TRIGGERED BY THE FOLLOWING BACKTRACE:
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [_be_fo_set_port_status] (0x8000): [RID#55] Setting status: PORT_NOT_WORKING. Called from: src/providers/ldap/sdap_async_connection.c: sdap_cli_connect_recv: 2125
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [fo_set_port_status] (0x0100): [RID#55] Marking port 389 of server 'ipa03.iad2.fedoraproject.org' as 'not working'
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [fo_set_port_status] (0x0400): [RID#55] Marking port 389 of duplicate server 'ipa03.iad2.fedoraproject.org' as 'not working'
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sdap_handle_release] (0x2000): [RID#55] Trace: sh[0x555a490e4fd0], connected[1], ops[(nil)], ldap[0x555a49148c80], destructor_lock[0], release_memory[0]
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [remove_connection_callback] (0x4000): [RID#55] Successfully removed connection callback.
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [dp_req_done] (0x0400): [RID#55] DP Request [Online Check #55]: Request handler finished [0]: Success
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [_dp_req_recv] (0x0400): [RID#55] DP Request [Online Check #55]: Receiving request data.
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [dp_req_destructor] (0x0400): [RID#55] DP Request [Online Check #55]: Request removed.
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [dp_req_destructor] (0x0400): [RID#55] Number of active DP request: 0
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [be_check_online_done] (0x0400): [RID#55] Error during online check [0]: Success
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [fo_reset_services] (0x1000): [RID#55] Resetting all servers in all services
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [set_srv_data_status] (0x0100): [RID#55] Marking SRV lookup of service 'IPA' as 'neutral'
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [set_server_common_status] (0x0100): [RID#55] Marking server 'ipa03.iad2.fedoraproject.org' as 'name not resolved'
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [fo_set_port_status] (0x0100): [RID#55] Marking port 389 of server 'ipa03.iad2.fedoraproject.org' as 'neutral'
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [fo_set_port_status] (0x0400): [RID#55] Marking port 389 of duplicate server 'ipa03.iad2.fedoraproject.org' as 'neutral'
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [set_srv_data_status] (0x0100): [RID#55] Marking SRV lookup of service 'IPA' as 'neutral'
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [set_server_common_status] (0x0100): [RID#55] Marking server 'ipa02.iad2.fedoraproject.org' as 'name not resolved'
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [fo_set_port_status] (0x0100): [RID#55] Marking port 389 of server 'ipa02.iad2.fedoraproject.org' as 'neutral'
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [fo_set_port_status] (0x0400): [RID#55] Marking port 389 of duplicate server 'ipa02.iad2.fedoraproject.org' as 'neutral'
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [set_srv_data_status] (0x0100): [RID#55] Marking SRV lookup of service 'IPA' as 'neutral'
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [set_server_common_status] (0x0100): [RID#55] Marking server 'ipa01.iad2.fedoraproject.org' as 'name not resolved'
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [fo_set_port_status] (0x0100): [RID#55] Marking port 389 of server 'ipa01.iad2.fedoraproject.org' as 'neutral'
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [fo_set_port_status] (0x0400): [RID#55] Marking port 389 of duplicate server 'ipa01.iad2.fedoraproject.org' as 'neutral'
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [set_server_common_status] (0x0100): [RID#55] Marking server 'ipa01.iad2.fedoraproject.org' as 'name not resolved'
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [fo_set_port_status] (0x0100): [RID#55] Marking port 0 of server 'ipa01.iad2.fedoraproject.org' as 'neutral'
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [fo_set_port_status] (0x0400): [RID#55] Marking port 0 of duplicate server 'ipa01.iad2.fedoraproject.org' as 'neutral'
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [reactivate_subdoms] (0x1000): [RID#55] Resetting all subdomains
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sss_domain_get_state] (0x1000): [RID#55] Domain fedoraproject.org is Active
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [be_ptask_disable] (0x0400): [RID#55] Task [Check if online (periodic)]: disabling task
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [be_run_online_cb] (0x0080): [RID#55] Going online. Running callbacks.
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [be_ptask_online_cb] (0x0400): [RID#55] Back end is online
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [be_ptask_enable] (0x0400): [RID#55] Task [Subdomains Refresh]: enabling task
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [be_ptask_schedule] (0x0400): [RID#55] Task [Subdomains Refresh]: scheduling task 0 seconds from now [1632982735]
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [be_ptask_execute] (0x0400): [RID#55] Task [Subdomains Refresh]: executing task, timeout 14400 seconds
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sdap_id_op_connect_step] (0x4000): [RID#55] beginning to connect
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [fo_resolve_service_send] (0x0100): [RID#55] Trying to resolve service 'IPA'
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [get_server_status] (0x1000): [RID#55] Status of server 'ipa03.iad2.fedoraproject.org' is 'name not resolved'
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [get_port_status] (0x1000): [RID#55] Port status of port 389 for server 'ipa03.iad2.fedoraproject.org' is 'neutral'
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [fo_resolve_service_activate_timeout] (0x2000): [RID#55] Resolve timeout [dns_resolver_timeout] set to 6 seconds
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [resolve_srv_send] (0x0200): [RID#55] The status of SRV lookup is neutral
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [collapse_srv_lookup] (0x0100): [RID#55] Need to refresh SRV lookup for domain fedoraproject.org
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [resolv_discover_srv_next_domain] (0x0400): [RID#55] SRV resolution of service 'ldap'. Will use DNS discovery domain 'fedoraproject.org'
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [resolv_getsrv_send] (0x0100): [RID#55] Trying to resolve SRV record of '_ldap._tcp.fedoraproject.org'
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [schedule_request_timeout] (0x2000): [RID#55] Scheduling a timeout of 3 seconds
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [schedule_timeout_watcher] (0x2000): [RID#55] Scheduling DNS timeout watcher
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [unschedule_timeout_watcher] (0x4000): [RID#55] Unscheduling DNS timeout watcher
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [resolv_getsrv_done] (0x1000): [RID#55] Using TTL [300]
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [request_watch_destructor] (0x0400): [RID#55] Deleting request watch
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [fo_discover_srv_done] (0x0400): [RID#55] Got answer. Processing...
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [fo_discover_srv_done] (0x0400): [RID#55] Got 3 servers
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [fo_add_server_to_list] (0x0400): [RID#55] Inserted primary server 'ipa01.iad2.fedoraproject.org:389' to service 'IPA'
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [fo_add_server_to_list] (0x0400): [RID#55] Inserted primary server 'ipa02.iad2.fedoraproject.org:389' to service 'IPA'
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [fo_add_server_to_list] (0x0400): [RID#55] Inserted primary server 'ipa03.iad2.fedoraproject.org:389' to service 'IPA'
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [set_srv_data_status] (0x0100): [RID#55] Marking SRV lookup of service 'IPA' as 'resolved'
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [get_server_status] (0x1000): [RID#55] Status of server 'ipa01.iad2.fedoraproject.org' is 'name not resolved'
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [resolv_is_address] (0x4000): [RID#55] [ipa01.iad2.fedoraproject.org] does not look like an IP address
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [resolv_gethostbyname_step] (0x2000): [RID#55] Querying files
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [resolv_gethostbyname_files_send] (0x0100): [RID#55] Trying to resolve A record of 'ipa01.iad2.fedoraproject.org' in files
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [set_server_common_status] (0x0100): [RID#55] Marking server 'ipa01.iad2.fedoraproject.org' as 'resolving name'
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [resolv_gethostbyname_step] (0x2000): [RID#55] Querying files
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [resolv_gethostbyname_files_send] (0x0100): [RID#55] Trying to resolve AAAA record of 'ipa01.iad2.fedoraproject.org' in files
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [resolv_gethostbyname_next] (0x0200): [RID#55] No more address families to retry
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [resolv_gethostbyname_step] (0x2000): [RID#55] Querying DNS
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [resolv_gethostbyname_dns_query] (0x0100): [RID#55] Trying to resolve A record of 'ipa01.iad2.fedoraproject.org' in DNS
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [schedule_request_timeout] (0x2000): [RID#55] Scheduling a timeout of 3 seconds
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [schedule_timeout_watcher] (0x2000): [RID#55] Scheduling DNS timeout watcher
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [unschedule_timeout_watcher] (0x4000): [RID#55] Unscheduling DNS timeout watcher
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [resolv_gethostbyname_dns_parse] (0x1000): [RID#55] Parsing an A reply
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [request_watch_destructor] (0x0400): [RID#55] Deleting request watch
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [set_server_common_status] (0x0100): [RID#55] Marking server 'ipa01.iad2.fedoraproject.org' as 'name resolved'
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [be_resolve_server_process] (0x1000): [RID#55] Saving the first resolved server
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [be_resolve_server_process] (0x0200): [RID#55] Found address for server ipa01.iad2.fedoraproject.org: [10.3.163.54] TTL 300
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [ipa_resolve_callback] (0x0400): [RID#55] Constructed uri 'ldap://ipa01.iad2.fedoraproject.org'
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sss_krb5_realm_has_proxy] (0x4000): [RID#55] Found KDC Proxy indicator [https://] in [https://id.fedoraproject.org/KdcProxy].
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [write_krb5info_file_contents] (0x0100): [RID#55] KDC Proxy available for realm [FEDORAPROJECT.ORG], no kdcinfo file created.
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sssd_async_socket_init_send] (0x4000): [RID#55] Using file descriptor [18] for the connection.
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sssd_async_socket_init_send] (0x0400): [RID#55] Setting 6 seconds timeout [ldap_network_timeout] for connecting
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sdap_ldap_connect_callback_add] (0x4000): [RID#55] New connection to [ldap://ipa01.iad2.fedoraproject.org:389/??base] with fd [18]
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sdap_get_rootdse_send] (0x4000): [RID#55] Getting rootdse
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sdap_print_server] (0x2000): [RID#55] Searching 10.3.163.54:389
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sdap_get_generic_ext_step] (0x0400): [RID#55] calling ldap_search_ext with [(objectclass=*)][].
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sdap_get_generic_ext_step] (0x1000): [RID#55] Requesting attrs: [*]
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sdap_get_generic_ext_step] (0x1000): [RID#55] Requesting attrs: [altServer]
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sdap_get_generic_ext_step] (0x1000): [RID#55] Requesting attrs: [namingContexts]
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sdap_get_generic_ext_step] (0x1000): [RID#55] Requesting attrs: [supportedControl]
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sdap_get_generic_ext_step] (0x1000): [RID#55] Requesting attrs: [supportedExtension]
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sdap_get_generic_ext_step] (0x1000): [RID#55] Requesting attrs: [supportedFeatures]
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sdap_get_generic_ext_step] (0x1000): [RID#55] Requesting attrs: [supportedLDAPVersion]
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sdap_get_generic_ext_step] (0x1000): [RID#55] Requesting attrs: [supportedSASLMechanisms]
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sdap_get_generic_ext_step] (0x1000): [RID#55] Requesting attrs: [domainControllerFunctionality]
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sdap_get_generic_ext_step] (0x1000): [RID#55] Requesting attrs: [defaultNamingContext]
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sdap_get_generic_ext_step] (0x1000): [RID#55] Requesting attrs: [lastUSN]
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sdap_get_generic_ext_step] (0x1000): [RID#55] Requesting attrs: [highestCommittedUSN]
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sdap_get_generic_ext_step] (0x2000): [RID#55] ldap_search_ext called, msgid = 1
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sdap_op_add] (0x2000): [RID#55] New operation 1 timeout 6
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sdap_process_result] (0x2000): Trace: sh[0x555a490fc170], connected[1], ops[0x555a4910c6b0], ldap[0x555a490f5f60]
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sdap_process_message] (0x4000): [RID#55] Message type: [LDAP_RES_SEARCH_ENTRY]
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sdap_parse_entry] (0x1000): [RID#55] OriginalDN: [].
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sdap_parse_range] (0x2000): [RID#55] No sub-attributes for [objectClass]
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sdap_parse_range] (0x2000): [RID#55] No sub-attributes for [vendorName]
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sdap_parse_range] (0x2000): [RID#55] No sub-attributes for [vendorVersion]
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sdap_parse_range] (0x2000): [RID#55] No sub-attributes for [dataversion]
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sdap_parse_range] (0x2000): [RID#55] No sub-attributes for [netscapemdsuffix]
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sdap_parse_range] (0x2000): [RID#55] No sub-attributes for [changeLog]
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sdap_parse_range] (0x2000): [RID#55] No sub-attributes for [firstchangenumber]
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sdap_parse_range] (0x2000): [RID#55] No sub-attributes for [lastchangenumber]
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sdap_parse_range] (0x2000): [RID#55] No sub-attributes for [ipatopologypluginversion]
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sdap_parse_range] (0x2000): [RID#55] No sub-attributes for [ipatopologyismanaged]
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sdap_parse_range] (0x2000): [RID#55] No sub-attributes for [ipaDomainLevel]
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sdap_parse_range] (0x2000): [RID#55] No sub-attributes for [namingContexts]
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sdap_parse_range] (0x2000): [RID#55] No sub-attributes for [supportedControl]
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sdap_parse_range] (0x2000): [RID#55] No sub-attributes for [supportedExtension]
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sdap_parse_range] (0x2000): [RID#55] No sub-attributes for [supportedFeatures]
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sdap_parse_range] (0x2000): [RID#55] No sub-attributes for [supportedLDAPVersion]
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sdap_parse_range] (0x2000): [RID#55] No sub-attributes for [supportedSASLMechanisms]
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sdap_parse_range] (0x2000): [RID#55] No sub-attributes for [defaultNamingContext]
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sdap_parse_range] (0x2000): [RID#55] No sub-attributes for [lastUSN]
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sdap_process_result] (0x2000): Trace: sh[0x555a490fc170], connected[1], ops[0x555a4910c6b0], ldap[0x555a490f5f60]
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sdap_process_message] (0x4000): [RID#55] Message type: [LDAP_RES_SEARCH_RESULT]
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sdap_get_generic_op_finished] (0x0400): [RID#55] Search result: Success(0), no errmsg set
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sdap_op_destructor] (0x2000): [RID#55] Operation 1 finished
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sdap_get_rootdse_done] (0x2000): [RID#55] Got rootdse
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sdap_get_rootdse_done] (0x2000): [RID#55] Skipping auto-detection of match rule
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sdap_get_server_opts_from_rootdse] (0x4000): [RID#55] USN value: 31834960 (int: 31834960)
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sdap_kinit_send] (0x0400): [RID#55] Attempting kinit (default, host/openqa-x86-worker02.iad2.fedoraproject.org, FEDORAPROJECT.ORG, 86400)
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sdap_kinit_next_kdc] (0x1000): [RID#55] Resolving next KDC for service IPA
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [fo_resolve_service_send] (0x0100): [RID#55] Trying to resolve service 'IPA'
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [get_server_status] (0x1000): [RID#55] Status of server 'ipa01.iad2.fedoraproject.org' is 'name resolved'
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [fo_resolve_service_activate_timeout] (0x2000): [RID#55] Resolve timeout [dns_resolver_timeout] set to 6 seconds
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [resolve_srv_send] (0x0200): [RID#55] The status of SRV lookup is resolved
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [get_server_status] (0x1000): [RID#55] Status of server 'ipa01.iad2.fedoraproject.org' is 'name resolved'
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [be_resolve_server_process] (0x1000): [RID#55] Saving the first resolved server
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [be_resolve_server_process] (0x0200): [RID#55] Found address for server ipa01.iad2.fedoraproject.org: [10.3.163.54] TTL 300
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sdap_kinit_kdc_resolved] (0x1000): [RID#55] KDC resolved, attempting to get TGT...
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [create_tgt_req_send_buffer] (0x0400): [RID#55] buffer size: 88
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [child_handler_setup] (0x2000): [RID#55] Setting up signal handler up for pid [215160]
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [child_handler_setup] (0x2000): [RID#55] Signal handler set up for pid [215160]
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [set_tgt_child_timeout] (0x0400): [RID#55] Setting 8 seconds timeout for TGT child
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sdap_process_result] (0x2000): Trace: sh[0x555a490fc170], connected[1], ops[(nil)], ldap[0x555a490f5f60]
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sdap_process_result] (0x2000): Trace: end of ldap_result list
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [write_pipe_handler] (0x0400): [RID#55] All data has been sent!
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [be_ptask_online_cb] (0x0400): [RID#55] Back end is online
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [be_ptask_enable] (0x0400): [RID#55] Task [SUDO Smart Refresh]: enabling task
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [be_ptask_schedule] (0x0400): [RID#55] Task [SUDO Smart Refresh]: scheduling task 900 seconds from now [1632983635]
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [be_ptask_online_cb] (0x0400): [RID#55] Back end is online
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [be_ptask_enable] (0x0400): [RID#55] Task [SUDO Full Refresh]: enabling task
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [be_ptask_schedule] (0x0400): [RID#55] Task [SUDO Full Refresh]: scheduling task 0 seconds from now [1632982735]
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [be_ptask_execute] (0x0400): [RID#55] Task [SUDO Full Refresh]: executing task, timeout 21600 seconds
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [ipa_sudo_full_refresh_send] (0x0400): [RID#55] Issuing a full refresh of sudo rules
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sdap_id_op_connect_step] (0x4000): [RID#55] waiting for connection to complete
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [delayed_online_authentication_callback] (0x0200): [RID#55] Backend is online, starting delayed online authentication.
   *  (2021-09-30  6:19:03): [be[fedoraproject.org]] [get_tgt_timeout_handler] (0x4000): [RID#55] timeout for sending SIGTERM to TGT child [215160] reached.
   *  (2021-09-30  6:19:03): [be[fedoraproject.org]] [get_tgt_timeout_handler] (0x0400): [RID#55] Setting 2 seconds timeout for sending SIGKILL to TGT child
   *  (2021-09-30  6:19:05): [be[fedoraproject.org]] [get_tgt_sigkill_handler] (0x4000): [RID#55] timeout for sending SIGKILL to TGT child [215160] reached.
   *  (2021-09-30  6:19:05): [be[fedoraproject.org]] [sdap_kinit_done] (0x0080): [RID#55] Communication with KDC timed out, trying the next one
   *  (2021-09-30  6:19:05): [be[fedoraproject.org]] [_be_fo_set_port_status] (0x8000): [RID#55] Setting status: PORT_NOT_WORKING. Called from: src/providers/ldap/sdap_async_connection.c: sdap_kinit_done: 1239
   *  (2021-09-30  6:19:05): [be[fedoraproject.org]] [fo_set_port_status] (0x0100): [RID#55] Marking port 389 of server 'ipa01.iad2.fedoraproject.org' as 'not working'
   *  (2021-09-30  6:19:05): [be[fedoraproject.org]] [fo_set_port_status] (0x0400): [RID#55] Marking port 389 of duplicate server 'ipa01.iad2.fedoraproject.org' as 'not working'
   *  (2021-09-30  6:19:05): [be[fedoraproject.org]] [sdap_kinit_next_kdc] (0x1000): [RID#55] Resolving next KDC for service IPA
   *  (2021-09-30  6:19:05): [be[fedoraproject.org]] [fo_resolve_service_send] (0x0100): [RID#55] Trying to resolve service 'IPA'
   *  (2021-09-30  6:19:05): [be[fedoraproject.org]] [get_server_status] (0x1000): [RID#55] Status of server 'ipa02.iad2.fedoraproject.org' is 'name not resolved'
   *  (2021-09-30  6:19:05): [be[fedoraproject.org]] [get_port_status] (0x1000): [RID#55] Port status of port 389 for server 'ipa02.iad2.fedoraproject.org' is 'neutral'
   *  (2021-09-30  6:19:05): [be[fedoraproject.org]] [fo_resolve_service_activate_timeout] (0x2000): [RID#55] Resolve timeout [dns_resolver_timeout] set to 6 seconds
   *  (2021-09-30  6:19:05): [be[fedoraproject.org]] [resolve_srv_send] (0x0200): [RID#55] The status of SRV lookup is resolved
   *  (2021-09-30  6:19:05): [be[fedoraproject.org]] [get_server_status] (0x1000): [RID#55] Status of server 'ipa02.iad2.fedoraproject.org' is 'name not resolved'
   *  (2021-09-30  6:19:05): [be[fedoraproject.org]] [resolv_is_address] (0x4000): [RID#55] [ipa02.iad2.fedoraproject.org] does not look like an IP address
   *  (2021-09-30  6:19:05): [be[fedoraproject.org]] [resolv_gethostbyname_step] (0x2000): [RID#55] Querying files
   *  (2021-09-30  6:19:05): [be[fedoraproject.org]] [resolv_gethostbyname_files_send] (0x0100): [RID#55] Trying to resolve A record of 'ipa02.iad2.fedoraproject.org' in files
   *  (2021-09-30  6:19:05): [be[fedoraproject.org]] [set_server_common_status] (0x0100): [RID#55] Marking server 'ipa02.iad2.fedoraproject.org' as 'resolving name'
   *  (2021-09-30  6:19:05): [be[fedoraproject.org]] [resolv_gethostbyname_step] (0x2000): [RID#55] Querying files
   *  (2021-09-30  6:19:05): [be[fedoraproject.org]] [resolv_gethostbyname_files_send] (0x0100): [RID#55] Trying to resolve AAAA record of 'ipa02.iad2.fedoraproject.org' in files
   *  (2021-09-30  6:19:05): [be[fedoraproject.org]] [resolv_gethostbyname_next] (0x0200): [RID#55] No more address families to retry
   *  (2021-09-30  6:19:05): [be[fedoraproject.org]] [resolv_gethostbyname_step] (0x2000): [RID#55] Querying DNS
   *  (2021-09-30  6:19:05): [be[fedoraproject.org]] [resolv_gethostbyname_dns_query] (0x0100): [RID#55] Trying to resolve A record of 'ipa02.iad2.fedoraproject.org' in DNS
   *  (2021-09-30  6:19:05): [be[fedoraproject.org]] [schedule_request_timeout] (0x2000): [RID#55] Scheduling a timeout of 3 seconds
   *  (2021-09-30  6:19:05): [be[fedoraproject.org]] [schedule_timeout_watcher] (0x2000): [RID#55] Scheduling DNS timeout watcher
   *  (2021-09-30  6:19:05): [be[fedoraproject.org]] [unschedule_timeout_watcher] (0x4000): [RID#55] Unscheduling DNS timeout watcher
   *  (2021-09-30  6:19:05): [be[fedoraproject.org]] [resolv_gethostbyname_dns_parse] (0x1000): [RID#55] Parsing an A reply
   *  (2021-09-30  6:19:05): [be[fedoraproject.org]] [request_watch_destructor] (0x0400): [RID#55] Deleting request watch
   *  (2021-09-30  6:19:05): [be[fedoraproject.org]] [set_server_common_status] (0x0100): [RID#55] Marking server 'ipa02.iad2.fedoraproject.org' as 'name resolved'
   *  (2021-09-30  6:19:05): [be[fedoraproject.org]] [be_resolve_server_process] (0x0200): [RID#55] Found address for server ipa02.iad2.fedoraproject.org: [10.3.163.55] TTL 300
   *  (2021-09-30  6:19:05): [be[fedoraproject.org]] [ipa_resolve_callback] (0x0400): [RID#55] Constructed uri 'ldap://ipa02.iad2.fedoraproject.org'
   *  (2021-09-30  6:19:05): [be[fedoraproject.org]] [sss_krb5_realm_has_proxy] (0x4000): [RID#55] Found KDC Proxy indicator [https://] in [https://id.fedoraproject.org/KdcProxy].
   *  (2021-09-30  6:19:05): [be[fedoraproject.org]] [write_krb5info_file_contents] (0x0100): [RID#55] KDC Proxy available for realm [FEDORAPROJECT.ORG], no kdcinfo file created.
   *  (2021-09-30  6:19:05): [be[fedoraproject.org]] [sdap_kinit_kdc_resolved] (0x1000): [RID#55] KDC resolved, attempting to get TGT...
   *  (2021-09-30  6:19:05): [be[fedoraproject.org]] [create_tgt_req_send_buffer] (0x0400): [RID#55] buffer size: 88
   *  (2021-09-30  6:19:05): [be[fedoraproject.org]] [child_handler_setup] (0x2000): [RID#55] Setting up signal handler up for pid [215170]
   *  (2021-09-30  6:19:05): [be[fedoraproject.org]] [child_handler_setup] (0x2000): [RID#55] Signal handler set up for pid [215170]
   *  (2021-09-30  6:19:05): [be[fedoraproject.org]] [set_tgt_child_timeout] (0x0400): [RID#55] Setting 8 seconds timeout for TGT child
   *  (2021-09-30  6:19:05): [be[fedoraproject.org]] [write_pipe_handler] (0x0400): [RID#55] All data has been sent!
   *  (2021-09-30  6:19:13): [be[fedoraproject.org]] [get_tgt_timeout_handler] (0x4000): [RID#55] timeout for sending SIGTERM to TGT child [215170] reached.
   *  (2021-09-30  6:19:13): [be[fedoraproject.org]] [get_tgt_timeout_handler] (0x0400): [RID#55] Setting 2 seconds timeout for sending SIGKILL to TGT child
   *  (2021-09-30  6:19:15): [be[fedoraproject.org]] [get_tgt_sigkill_handler] (0x4000): [RID#55] timeout for sending SIGKILL to TGT child [215170] reached.
   *  (2021-09-30  6:19:15): [be[fedoraproject.org]] [sdap_kinit_done] (0x0080): [RID#55] Communication with KDC timed out, trying the next one
   *  (2021-09-30  6:19:15): [be[fedoraproject.org]] [_be_fo_set_port_status] (0x8000): [RID#55] Setting status: PORT_NOT_WORKING. Called from: src/providers/ldap/sdap_async_connection.c: sdap_kinit_done: 1239
   *  (2021-09-30  6:19:15): [be[fedoraproject.org]] [fo_set_port_status] (0x0100): [RID#55] Marking port 389 of server 'ipa02.iad2.fedoraproject.org' as 'not working'
   *  (2021-09-30  6:19:15): [be[fedoraproject.org]] [fo_set_port_status] (0x0400): [RID#55] Marking port 389 of duplicate server 'ipa02.iad2.fedoraproject.org' as 'not working'
   *  (2021-09-30  6:19:15): [be[fedoraproject.org]] [sdap_kinit_next_kdc] (0x1000): [RID#55] Resolving next KDC for service IPA
   *  (2021-09-30  6:19:15): [be[fedoraproject.org]] [fo_resolve_service_send] (0x0100): [RID#55] Trying to resolve service 'IPA'
   *  (2021-09-30  6:19:15): [be[fedoraproject.org]] [get_server_status] (0x1000): [RID#55] Status of server 'ipa03.iad2.fedoraproject.org' is 'name not resolved'
   *  (2021-09-30  6:19:15): [be[fedoraproject.org]] [get_port_status] (0x1000): [RID#55] Port status of port 389 for server 'ipa03.iad2.fedoraproject.org' is 'neutral'
   *  (2021-09-30  6:19:15): [be[fedoraproject.org]] [fo_resolve_service_activate_timeout] (0x2000): [RID#55] Resolve timeout [dns_resolver_timeout] set to 6 seconds
   *  (2021-09-30  6:19:15): [be[fedoraproject.org]] [resolve_srv_send] (0x0200): [RID#55] The status of SRV lookup is resolved
   *  (2021-09-30  6:19:15): [be[fedoraproject.org]] [get_server_status] (0x1000): [RID#55] Status of server 'ipa03.iad2.fedoraproject.org' is 'name not resolved'
   *  (2021-09-30  6:19:15): [be[fedoraproject.org]] [resolv_is_address] (0x4000): [RID#55] [ipa03.iad2.fedoraproject.org] does not look like an IP address
   *  (2021-09-30  6:19:15): [be[fedoraproject.org]] [resolv_gethostbyname_step] (0x2000): [RID#55] Querying files
   *  (2021-09-30  6:19:15): [be[fedoraproject.org]] [resolv_gethostbyname_files_send] (0x0100): [RID#55] Trying to resolve A record of 'ipa03.iad2.fedoraproject.org' in files
   *  (2021-09-30  6:19:15): [be[fedoraproject.org]] [set_server_common_status] (0x0100): [RID#55] Marking server 'ipa03.iad2.fedoraproject.org' as 'resolving name'
   *  (2021-09-30  6:19:15): [be[fedoraproject.org]] [resolv_gethostbyname_step] (0x2000): [RID#55] Querying files
   *  (2021-09-30  6:19:15): [be[fedoraproject.org]] [resolv_gethostbyname_files_send] (0x0100): [RID#55] Trying to resolve AAAA record of 'ipa03.iad2.fedoraproject.org' in files
   *  (2021-09-30  6:19:15): [be[fedoraproject.org]] [resolv_gethostbyname_next] (0x0200): [RID#55] No more address families to retry
   *  (2021-09-30  6:19:15): [be[fedoraproject.org]] [resolv_gethostbyname_step] (0x2000): [RID#55] Querying DNS
   *  (2021-09-30  6:19:15): [be[fedoraproject.org]] [resolv_gethostbyname_dns_query] (0x0100): [RID#55] Trying to resolve A record of 'ipa03.iad2.fedoraproject.org' in DNS
   *  (2021-09-30  6:19:15): [be[fedoraproject.org]] [schedule_request_timeout] (0x2000): [RID#55] Scheduling a timeout of 3 seconds
   *  (2021-09-30  6:19:15): [be[fedoraproject.org]] [schedule_timeout_watcher] (0x2000): [RID#55] Scheduling DNS timeout watcher
   *  (2021-09-30  6:19:15): [be[fedoraproject.org]] [unschedule_timeout_watcher] (0x4000): [RID#55] Unscheduling DNS timeout watcher
   *  (2021-09-30  6:19:15): [be[fedoraproject.org]] [resolv_gethostbyname_dns_parse] (0x1000): [RID#55] Parsing an A reply
   *  (2021-09-30  6:19:15): [be[fedoraproject.org]] [request_watch_destructor] (0x0400): [RID#55] Deleting request watch
   *  (2021-09-30  6:19:15): [be[fedoraproject.org]] [set_server_common_status] (0x0100): [RID#55] Marking server 'ipa03.iad2.fedoraproject.org' as 'name resolved'
   *  (2021-09-30  6:19:15): [be[fedoraproject.org]] [be_resolve_server_process] (0x0200): [RID#55] Found address for server ipa03.iad2.fedoraproject.org: [10.3.163.104] TTL 300
   *  (2021-09-30  6:19:15): [be[fedoraproject.org]] [ipa_resolve_callback] (0x0400): [RID#55] Constructed uri 'ldap://ipa03.iad2.fedoraproject.org'
   *  (2021-09-30  6:19:15): [be[fedoraproject.org]] [sss_krb5_realm_has_proxy] (0x4000): [RID#55] Found KDC Proxy indicator [https://] in [https://id.fedoraproject.org/KdcProxy].
   *  (2021-09-30  6:19:15): [be[fedoraproject.org]] [write_krb5info_file_contents] (0x0100): [RID#55] KDC Proxy available for realm [FEDORAPROJECT.ORG], no kdcinfo file created.
   *  (2021-09-30  6:19:15): [be[fedoraproject.org]] [sdap_kinit_kdc_resolved] (0x1000): [RID#55] KDC resolved, attempting to get TGT...
   *  (2021-09-30  6:19:15): [be[fedoraproject.org]] [create_tgt_req_send_buffer] (0x0400): [RID#55] buffer size: 88
   *  (2021-09-30  6:19:15): [be[fedoraproject.org]] [child_handler_setup] (0x2000): [RID#55] Setting up signal handler up for pid [215181]
   *  (2021-09-30  6:19:15): [be[fedoraproject.org]] [child_handler_setup] (0x2000): [RID#55] Signal handler set up for pid [215181]
   *  (2021-09-30  6:19:15): [be[fedoraproject.org]] [set_tgt_child_timeout] (0x0400): [RID#55] Setting 8 seconds timeout for TGT child
   *  (2021-09-30  6:19:15): [be[fedoraproject.org]] [write_pipe_handler] (0x0400): [RID#55] All data has been sent!
   *  (2021-09-30  6:19:23): [be[fedoraproject.org]] [get_tgt_timeout_handler] (0x4000): [RID#55] timeout for sending SIGTERM to TGT child [215181] reached.
   *  (2021-09-30  6:19:23): [be[fedoraproject.org]] [get_tgt_timeout_handler] (0x0400): [RID#55] Setting 2 seconds timeout for sending SIGKILL to TGT child
   *  (2021-09-30  6:19:25): [be[fedoraproject.org]] [get_tgt_sigkill_handler] (0x4000): [RID#55] timeout for sending SIGKILL to TGT child [215181] reached.
   *  (2021-09-30  6:19:25): [be[fedoraproject.org]] [sdap_kinit_done] (0x0080): [RID#55] Communication with KDC timed out, trying the next one
   *  (2021-09-30  6:19:25): [be[fedoraproject.org]] [_be_fo_set_port_status] (0x8000): [RID#55] Setting status: PORT_NOT_WORKING. Called from: src/providers/ldap/sdap_async_connection.c: sdap_kinit_done: 1239
   *  (2021-09-30  6:19:25): [be[fedoraproject.org]] [fo_set_port_status] (0x0100): [RID#55] Marking port 389 of server 'ipa03.iad2.fedoraproject.org' as 'not working'
   *  (2021-09-30  6:19:25): [be[fedoraproject.org]] [fo_set_port_status] (0x0400): [RID#55] Marking port 389 of duplicate server 'ipa03.iad2.fedoraproject.org' as 'not working'
   *  (2021-09-30  6:19:25): [be[fedoraproject.org]] [sdap_kinit_next_kdc] (0x1000): [RID#55] Resolving next KDC for service IPA
   *  (2021-09-30  6:19:25): [be[fedoraproject.org]] [fo_resolve_service_send] (0x0100): [RID#55] Trying to resolve service 'IPA'
   *  (2021-09-30  6:19:25): [be[fedoraproject.org]] [get_server_status] (0x1000): [RID#55] Status of server 'ipa01.iad2.fedoraproject.org' is 'name resolved'
   *  (2021-09-30  6:19:25): [be[fedoraproject.org]] [get_port_status] (0x1000): [RID#55] Port status of port 0 for server 'ipa01.iad2.fedoraproject.org' is 'neutral'
   *  (2021-09-30  6:19:25): [be[fedoraproject.org]] [fo_resolve_service_activate_timeout] (0x2000): [RID#55] Resolve timeout [dns_resolver_timeout] set to 6 seconds
   *  (2021-09-30  6:19:25): [be[fedoraproject.org]] [get_server_status] (0x1000): [RID#55] Status of server 'ipa01.iad2.fedoraproject.org' is 'name resolved'
   *  (2021-09-30  6:19:25): [be[fedoraproject.org]] [be_resolve_server_process] (0x0200): [RID#55] Found address for server ipa01.iad2.fedoraproject.org: [10.3.163.54] TTL 300
   *  (2021-09-30  6:19:25): [be[fedoraproject.org]] [ipa_resolve_callback] (0x0400): [RID#55] Constructed uri 'ldap://ipa01.iad2.fedoraproject.org'
   *  (2021-09-30  6:19:25): [be[fedoraproject.org]] [sss_krb5_realm_has_proxy] (0x4000): [RID#55] Found KDC Proxy indicator [https://] in [https://id.fedoraproject.org/KdcProxy].
   *  (2021-09-30  6:19:25): [be[fedoraproject.org]] [write_krb5info_file_contents] (0x0100): [RID#55] KDC Proxy available for realm [FEDORAPROJECT.ORG], no kdcinfo file created.
   *  (2021-09-30  6:19:25): [be[fedoraproject.org]] [sdap_kinit_kdc_resolved] (0x1000): [RID#55] KDC resolved, attempting to get TGT...
   *  (2021-09-30  6:19:25): [be[fedoraproject.org]] [create_tgt_req_send_buffer] (0x0400): [RID#55] buffer size: 88
   *  (2021-09-30  6:19:25): [be[fedoraproject.org]] [child_handler_setup] (0x2000): [RID#55] Setting up signal handler up for pid [215190]
   *  (2021-09-30  6:19:25): [be[fedoraproject.org]] [child_handler_setup] (0x2000): [RID#55] Signal handler set up for pid [215190]
   *  (2021-09-30  6:19:25): [be[fedoraproject.org]] [set_tgt_child_timeout] (0x0400): [RID#55] Setting 8 seconds timeout for TGT child
   *  (2021-09-30  6:19:25): [be[fedoraproject.org]] [write_pipe_handler] (0x0400): [RID#55] All data has been sent!
   *  (2021-09-30  6:19:33): [be[fedoraproject.org]] [get_tgt_timeout_handler] (0x4000): [RID#55] timeout for sending SIGTERM to TGT child [215190] reached.
   *  (2021-09-30  6:19:33): [be[fedoraproject.org]] [get_tgt_timeout_handler] (0x0400): [RID#55] Setting 2 seconds timeout for sending SIGKILL to TGT child
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [get_tgt_sigkill_handler] (0x4000): [RID#55] timeout for sending SIGKILL to TGT child [215190] reached.
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [sdap_kinit_done] (0x0080): [RID#55] Communication with KDC timed out, trying the next one
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [_be_fo_set_port_status] (0x8000): [RID#55] Setting status: PORT_NOT_WORKING. Called from: src/providers/ldap/sdap_async_connection.c: sdap_kinit_done: 1239
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [fo_set_port_status] (0x0100): [RID#55] Marking port 0 of server 'ipa01.iad2.fedoraproject.org' as 'not working'
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [fo_set_port_status] (0x0400): [RID#55] Marking port 0 of duplicate server 'ipa01.iad2.fedoraproject.org' as 'not working'
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [sdap_kinit_next_kdc] (0x1000): [RID#55] Resolving next KDC for service IPA
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [fo_resolve_service_send] (0x0100): [RID#55] Trying to resolve service 'IPA'
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [get_server_status] (0x1000): [RID#55] Status of server 'ipa01.iad2.fedoraproject.org' is 'name resolved'
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [get_port_status] (0x1000): [RID#55] Port status of port 389 for server 'ipa01.iad2.fedoraproject.org' is 'not working'
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [get_port_status] (0x0080): [RID#55] SSSD is unable to complete the full connection request, this internal status does not necessarily indicate network port issues.
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [get_server_status] (0x1000): [RID#55] Status of server 'ipa02.iad2.fedoraproject.org' is 'name resolved'
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [get_port_status] (0x1000): [RID#55] Port status of port 389 for server 'ipa02.iad2.fedoraproject.org' is 'not working'
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [get_port_status] (0x0080): [RID#55] SSSD is unable to complete the full connection request, this internal status does not necessarily indicate network port issues.
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [get_server_status] (0x1000): [RID#55] Status of server 'ipa03.iad2.fedoraproject.org' is 'name resolved'
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [get_port_status] (0x1000): [RID#55] Port status of port 389 for server 'ipa03.iad2.fedoraproject.org' is 'not working'
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [get_port_status] (0x0080): [RID#55] SSSD is unable to complete the full connection request, this internal status does not necessarily indicate network port issues.
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [get_server_status] (0x1000): [RID#55] Status of server 'ipa01.iad2.fedoraproject.org' is 'name resolved'
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [get_port_status] (0x1000): [RID#55] Port status of port 0 for server 'ipa01.iad2.fedoraproject.org' is 'not working'
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [get_port_status] (0x0080): [RID#55] SSSD is unable to complete the full connection request, this internal status does not necessarily indicate network port issues.
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [fo_resolve_service_send] (0x0020): [RID#55] No available servers for service 'IPA'
********************** BACKTRACE DUMP ENDS HERE *********************************

(2021-09-30  6:19:35): [be[fedoraproject.org]] [sdap_cli_connect_recv] (0x0040): [RID#55] Unable to establish connection [13]: Permission denied
********************** PREVIOUS MESSAGE WAS TRIGGERED BY THE FOLLOWING BACKTRACE:
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [be_resolve_server_done] (0x1000): [RID#55] Server [NULL] resolution failed: [5]: Input/output error
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [sdap_cli_kinit_done] (0x0400): [RID#55] Cannot get a TGT: ret [1432158230](Network I/O Error)
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [sdap_cli_connect_recv] (0x0040): [RID#55] Unable to establish connection [13]: Permission denied
********************** BACKTRACE DUMP ENDS HERE *********************************

(2021-09-30  6:19:35): [be[fedoraproject.org]] [fo_resolve_service_send] (0x0020): [RID#55] No available servers for service 'IPA'
********************** PREVIOUS MESSAGE WAS TRIGGERED BY THE FOLLOWING BACKTRACE:
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [_be_fo_set_port_status] (0x8000): [RID#55] Setting status: PORT_NOT_WORKING. Called from: src/providers/ldap/sdap_async_connection.c: sdap_cli_connect_recv: 2125
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [fo_set_port_status] (0x0100): [RID#55] Marking port 389 of server 'ipa01.iad2.fedoraproject.org' as 'not working'
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [fo_set_port_status] (0x0400): [RID#55] Marking port 389 of duplicate server 'ipa01.iad2.fedoraproject.org' as 'not working'
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [sdap_handle_release] (0x2000): [RID#55] Trace: sh[0x555a490fc170], connected[1], ops[(nil)], ldap[0x555a490f5f60], destructor_lock[0], release_memory[0]
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [remove_connection_callback] (0x4000): [RID#55] Successfully removed connection callback.
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [sdap_id_op_connect_done] (0x4000): [RID#55] attempting failover retry on op #1
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [sdap_id_op_connect_step] (0x4000): [RID#55] beginning to connect
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [fo_resolve_service_send] (0x0100): [RID#55] Trying to resolve service 'IPA'
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [get_server_status] (0x1000): [RID#55] Status of server 'ipa01.iad2.fedoraproject.org' is 'name resolved'
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [get_port_status] (0x1000): [RID#55] Port status of port 389 for server 'ipa01.iad2.fedoraproject.org' is 'not working'
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [get_port_status] (0x0080): [RID#55] SSSD is unable to complete the full connection request, this internal status does not necessarily indicate network port issues.
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [get_server_status] (0x1000): [RID#55] Status of server 'ipa02.iad2.fedoraproject.org' is 'name resolved'
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [get_port_status] (0x1000): [RID#55] Port status of port 389 for server 'ipa02.iad2.fedoraproject.org' is 'not working'
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [get_port_status] (0x0080): [RID#55] SSSD is unable to complete the full connection request, this internal status does not necessarily indicate network port issues.
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [get_server_status] (0x1000): [RID#55] Status of server 'ipa03.iad2.fedoraproject.org' is 'name resolved'
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [get_port_status] (0x1000): [RID#55] Port status of port 389 for server 'ipa03.iad2.fedoraproject.org' is 'not working'
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [get_port_status] (0x0080): [RID#55] SSSD is unable to complete the full connection request, this internal status does not necessarily indicate network port issues.
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [get_server_status] (0x1000): [RID#55] Status of server 'ipa01.iad2.fedoraproject.org' is 'name resolved'
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [get_port_status] (0x1000): [RID#55] Port status of port 0 for server 'ipa01.iad2.fedoraproject.org' is 'not working'
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [get_port_status] (0x0080): [RID#55] SSSD is unable to complete the full connection request, this internal status does not necessarily indicate network port issues.
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [fo_resolve_service_send] (0x0020): [RID#55] No available servers for service 'IPA'
********************** BACKTRACE DUMP ENDS HERE *********************************

(2021-09-30  6:19:35): [be[fedoraproject.org]] [sdap_id_op_connect_done] (0x0040): [RID#55] Failed to connect, going offline (5 [Input/output error])
********************** PREVIOUS MESSAGE WAS TRIGGERED BY THE FOLLOWING BACKTRACE:
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [sdap_id_op_connect_done] (0x4000): [RID#55] attempting failover retry on op #2
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [sdap_id_op_connect_step] (0x4000): [RID#55] waiting for connection to complete
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [sdap_id_release_conn_data] (0x4000): [RID#55] Releasing unused connection with fd [-1]
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [be_resolve_server_done] (0x1000): [RID#55] Server [NULL] resolution failed: [5]: Input/output error
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [sdap_id_op_connect_done] (0x0040): [RID#55] Failed to connect, going offline (5 [Input/output error])
********************** BACKTRACE DUMP ENDS HERE *********************************

(2021-09-30  6:19:35): [be[fedoraproject.org]] [ipa_subdomains_refresh_connect_done] (0x0020): [RID#55] Unable to connect to LDAP [11]: Resource temporarily unavailable
********************** PREVIOUS MESSAGE WAS TRIGGERED BY THE FOLLOWING BACKTRACE:
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [be_mark_offline] (0x2000): [RID#55] Going offline!
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [be_mark_offline] (0x2000): [RID#55] Enable check_if_online_ptask.
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [be_ptask_enable] (0x0400): [RID#55] Task [Check if online (periodic)]: enabling task
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [be_ptask_schedule] (0x0400): [RID#55] Task [Check if online (periodic)]: scheduling task 72 seconds from now [1632982847]
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [be_run_offline_cb] (0x0080): [RID#55] Going offline. Running callbacks.
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [sdap_id_op_connect_done] (0x4000): [RID#55] notify offline to op #1
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [ipa_subdomains_refresh_connect_done] (0x0020): [RID#55] Unable to connect to LDAP [11]: Resource temporarily unavailable
********************** BACKTRACE DUMP ENDS HERE *********************************

(2021-09-30  6:19:35): [be[fedoraproject.org]] [be_ptask_done] (0x0040): [RID#55] Task [Subdomains Refresh]: failed with [1432158212]: SSSD is offline
********************** PREVIOUS MESSAGE WAS TRIGGERED BY THE FOLLOWING BACKTRACE:
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [ipa_subdomains_refresh_connect_done] (0x0080): [RID#55] No IPA server is available, cannot get the subdomain list while offline
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [be_ptask_done] (0x0040): [RID#55] Task [Subdomains Refresh]: failed with [1432158212]: SSSD is offline
********************** BACKTRACE DUMP ENDS HERE *********************************

(2021-09-30  6:19:35): [be[fedoraproject.org]] [ipa_sudo_refresh_connect_done] (0x0020): [RID#55] SUDO LDAP connection failed [11]: Resource temporarily unavailable
********************** PREVIOUS MESSAGE WAS TRIGGERED BY THE FOLLOWING BACKTRACE:
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [be_ptask_schedule] (0x0400): [RID#55] Task [Subdomains Refresh]: scheduling task 14400 seconds from now [1632997175]
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [sdap_id_op_connect_done] (0x4000): [RID#55] notify offline to op #2
   *  (2021-09-30  6:19:35): [be[fedoraproject.org]] [ipa_sudo_refresh_connect_done] (0x0020): [RID#55] SUDO LDAP connection failed [11]: Resource temporarily unavailable
********************** BACKTRACE DUMP ENDS HERE *********************************

(2021-09-30  6:19:35): [be[fedoraproject.org]] [be_ptask_done] (0x0040): [RID#55] Task [SUDO Full Refresh]: failed with [11]: Resource temporarily unavailable

aha, yeah, it's this TGT stuff. Note these lines:

   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sdap_kinit_kdc_resolved] (0x1000): [RID#55] KDC resolved, attempting to get TGT...
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [create_tgt_req_send_buffer] (0x0400): [RID#55] buffer size: 88
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [child_handler_setup] (0x2000): [RID#55] Setting up signal handler up for pid [215160]
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [child_handler_setup] (0x2000): [RID#55] Signal handler set up for pid [215160]
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [set_tgt_child_timeout] (0x0400): [RID#55] Setting 8 seconds timeout for TGT child
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sdap_process_result] (0x2000): Trace: sh[0x555a490fc170], connected[1], ops[(nil)], ldap[0x555a490f5f60]
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sdap_process_result] (0x2000): Trace: end of ldap_result list
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [write_pipe_handler] (0x0400): [RID#55] All data has been sent!
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [be_ptask_online_cb] (0x0400): [RID#55] Back end is online
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [be_ptask_enable] (0x0400): [RID#55] Task [SUDO Smart Refresh]: enabling task
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [be_ptask_schedule] (0x0400): [RID#55] Task [SUDO Smart Refresh]: scheduling task 900 seconds from now [1632983635]
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [be_ptask_online_cb] (0x0400): [RID#55] Back end is online
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [be_ptask_enable] (0x0400): [RID#55] Task [SUDO Full Refresh]: enabling task
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [be_ptask_schedule] (0x0400): [RID#55] Task [SUDO Full Refresh]: scheduling task 0 seconds from now [1632982735]
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [be_ptask_execute] (0x0400): [RID#55] Task [SUDO Full Refresh]: executing task, timeout 21600 seconds
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [ipa_sudo_full_refresh_send] (0x0400): [RID#55] Issuing a full refresh of sudo rules
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [sdap_id_op_connect_step] (0x4000): [RID#55] waiting for connection to complete
   *  (2021-09-30  6:18:55): [be[fedoraproject.org]] [delayed_online_authentication_callback] (0x0200): [RID#55] Backend is online, starting delayed online authentication.
   *  (2021-09-30  6:19:03): [be[fedoraproject.org]] [get_tgt_timeout_handler] (0x4000): [RID#55] timeout for sending SIGTERM to TGT child [215160] reached.
   *  (2021-09-30  6:19:03): [be[fedoraproject.org]] [get_tgt_timeout_handler] (0x0400): [RID#55] Setting 2 seconds timeout for sending SIGKILL to TGT child
   *  (2021-09-30  6:19:05): [be[fedoraproject.org]] [get_tgt_sigkill_handler] (0x4000): [RID#55] timeout for sending SIGKILL to TGT child [215160] reached.
   *  (2021-09-30  6:19:05): [be[fedoraproject.org]] [sdap_kinit_done] (0x0080): [RID#55] Communication with KDC timed out, trying the next one

pid 215160 is one of the stuck ldap_child processes. So it seems like it tries to get a TGT, fails for some reason, then fails to kill the process, then retries, and just keeps on doing that forever in an eternal loop, all the while piling up unkillable processes, till the system falls over.

I don't know why the TGT request is failing and resulting in a zombie process, though. If I look at the same log file on x86-worker01, which doesn't seem to be hitting this problem, it just works:

   *  (2021-09-26  1:06:34): [be[fedoraproject.org]] [sdap_kinit_kdc_resolved] (0x1000): [RID#741] KDC resolved, attempting to get TGT...
   *  (2021-09-26  1:06:34): [be[fedoraproject.org]] [create_tgt_req_send_buffer] (0x0400): [RID#741] buffer size: 88
   *  (2021-09-26  1:06:34): [be[fedoraproject.org]] [child_handler_setup] (0x2000): [RID#741] Setting up signal handler up for pid [4078266]
   *  (2021-09-26  1:06:34): [be[fedoraproject.org]] [child_handler_setup] (0x2000): [RID#741] Signal handler set up for pid [4078266]
   *  (2021-09-26  1:06:34): [be[fedoraproject.org]] [set_tgt_child_timeout] (0x0400): [RID#741] Setting 8 seconds timeout for TGT child
   *  (2021-09-26  1:06:34): [be[fedoraproject.org]] [sdap_process_result] (0x2000): Trace: sh[0x55d1e07b16c0], connected[1], ops[(nil)], ldap[0x55d1e07878a0]
   *  (2021-09-26  1:06:34): [be[fedoraproject.org]] [sdap_process_result] (0x2000): Trace: end of ldap_result list
   *  (2021-09-26  1:06:34): [be[fedoraproject.org]] [write_pipe_handler] (0x0400): [RID#741] All data has been sent!
   *  (2021-09-26  1:06:34): [be[fedoraproject.org]] [read_pipe_handler] (0x0400): [RID#741] EOF received, client finished
   *  (2021-09-26  1:06:34): [be[fedoraproject.org]] [sdap_get_tgt_recv] (0x0400): [RID#741] Child responded: 0 [FILE:/var/lib/sss/db/ccache_FEDORAPROJECT.ORG], expired on [1632704794]
   *  (2021-09-26  1:06:34): [be[fedoraproject.org]] [sdap_cli_auth_step] (0x0100): [RID#741] expire timeout is 900
   *  (2021-09-26  1:06:34): [be[fedoraproject.org]] [sdap_cli_auth_step] (0x1000): [RID#741] the connection will expire at 1632619294
   *  (2021-09-26  1:06:34): [be[fedoraproject.org]] [sasl_bind_send] (0x0100): [RID#741] Executing sasl bind mech: GSSAPI, user: host/openqa-x86-worker01.iad2.fedoraproject.org
   *  (2021-09-26  1:06:34): [be[fedoraproject.org]] [child_sig_handler] (0x1000): [RID#741] Waiting for child [4078266].
   *  (2021-09-26  1:06:34): [be[fedoraproject.org]] [child_sig_handler] (0x0100): [RID#741] child [4078266] finished successfully.

@pbrezina any idea what might be going on here? See above few comments - we have some systems in Fedora infra which are repeatedly getting stuck on some problem with requesting a TGT from FreeIPA, which piles up unkillable zombie ldap_child processes till the system keels over...

I can telnet to the FreeIPA / KDC server on ports 389 and 88 just fine, so I don't think it's a connection issue, exactly?

Hmm, still, looking at it another way, the thing sssd is essentially having trouble with there is forking. And it seems like other things have the same problem - running some things (ps, top) works when the system is in this state, but other things (running htop or man, sshing in) do not, and I suspect that's because they get stuck in a similar way. So maybe these are all symptoms of the same problem to do with forking processes, but we don't know what the problem really is yet? This is very weird, though.

If this is reproducible with other processes, can you get strace? What does /var/log/sssd/krb5_child.log and ldap_child.log say?

unfortunately, trying to strace anything is another thing that hangs when the system is in this state :(

I have now downgraded worker04 to a 5.11 kernel, to see if this is a kernel issue somehow.

so worker02 and worker04 on kernel 5.11 have made it a few hours without blowing up. That suggests this may be a kernel issue, so I've filed https://bugzilla.redhat.com/show_bug.cgi?id=2009585 .

sorry to make you do the unnecessary redeployment, @mobrien ! I'll close this issue as it looks like it's probably not an infra problem at this point.

Metadata Update from @adamwill:
- Issue close_status updated to: Will Not/Can Not fix
- Issue status updated to: Closed (was: Open)

2 years ago

Issue status updated to: Open (was: Closed)

2 years ago

so worker02 and worker04 on kernel 5.11 have made it a few hours without blowing up. That suggests this may be a kernel issue, so I've filed https://bugzilla.redhat.com/show_bug.cgi?id=2009585 .

sorry to make you do the unnecessary redeployment, @mobrien ! I'll close this issue as it looks like it's probably not an infra problem at this point.

No problem at all Adam, glad it helped lead the way to finding the cause of the problem

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

2 years ago

Login to comment on this ticket.

Metadata
Boards 1
ops Status: Backlog