#4132 sssd_be: Busy loops on flaky LDAP, SIGTERM from watchdog not processed
Closed: duplicate a year ago by natureshadow. Opened a year ago by natureshadow.

Forwarded from the Debian BTS:

In a setup with sssd using a remote slapd for NSS, and a somewhat flaky network in between, sssd_be tends to get into a busy loop sometimes, using 100% CPU time on one core.

Debugging showed that sssd has a watchdog to clean up in such cases, but sssd_be installs a signal handler that prevents the SIGTERM on the processgroup to be processed correctly, and does not exit.

src/util/util_watchdog.c:

     64 /* the watchdog is purposefully *not* handled by the tevent
     65  * signal handler as it is meant to check if the daemon is
     66  * still processing the event queue itself. A stuck process
     67  * may not handle the event queue at all and thus not handle
     68  * signals either */
     69 static void watchdog_handler(int sig)
     70 {
     71
     72     watchdog_detect_timeshift();
     73
     74     /* if a pre-defined number of ticks passed by kills itself */
     75     if (__sync_add_and_fetch(&watchdog_ctx.ticks, 1) >
+WATCHDOG_MAX_TICKS) {
     76         if (getpid() == getpgrp()) {
     77             kill(-getpgrp(), SIGTERM);
     78         } else {
     79             _exit(1);
     80         }
     81     }
     82 }

(NB. Seems what is described in the comment was not all too successful ;)

The signal handler is installed in src/providers/data_provider_be.c:

    448 static void be_process_finalize(struct tevent_context *ev,
    449                                 struct tevent_signal *se,
    450                                 int signum,
    451                                 int count,
    452                                 void *siginfo,
    453                                 void *private_data)
    454 {
    455     struct be_ctx *be_ctx;
    456
    457     be_ctx = talloc_get_type(private_data, struct be_ctx);
    458     talloc_free(be_ctx);
    459     orderly_shutdown(0);
    460 }
    461
    462 static errno_t be_process_install_sigterm_handler(struct be_ctx *be_ctx)
    463 {
    464     struct tevent_signal *sige;
    465
    466     BlockSignals(false, SIGTERM);
    467
    468     sige = tevent_add_signal(be_ctx->ev, be_ctx, SIGTERM, SA_SIGINFO,
    469                              be_process_finalize, be_ctx);
    470     if (sige == NULL) {
    471         DEBUG(SSSDBG_CRIT_FAILURE, "tevent_add_signal failed.\n");
    472         return ENOMEM;
    473     }
    474
    475     return EOK;
    476 }

Setting a breakpoint on be_process_finalize showed that this function is never reached, probably because libtevent never gets around to calling it.

Two proposals to circumvent this are:
a) Reset the handler before calling kill on the process group in line 77 (e.g. signal(SIGTERM, SIG_DFL);)
b) Move the exit call in line 79 out of the branch so it gets called unconditionally in case kill() fails to kill the process itself

We tested solution a) in gdb and it caused sssd_be to exit cleanly and restart, as it should.


Hi @natureshadow,

could you please check if recently merged https://github.com/SSSD/sssd/pull/964 fixes your issue?

This is actually item (b) of your proposal.

Indeed seems like this is a duplicate of #4089.

Metadata Update from @natureshadow:
- Issue close_status updated to: duplicate
- Issue status updated to: Closed (was: Open)

a year ago

Hi @atikhonov,

it’s not as if the issue is reliably triggered. It just happens that sometimes sssd appears to stop working and uses a full CPU core; with the fix, it would just exit properly.

I assume the fix is correct, and the package maintainers will want to backport it.

Thanks!

SSSD is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in SSSD's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/SSSD/sssd/issues/5093

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Login to comment on this ticket.

Metadata