From f24325cf8505e810329178c7c4575ede4f256a7f Mon Sep 17 00:00:00 2001 From: David Teigland Date: Aug 08 2012 21:58:37 +0000 Subject: wdmd: preemptive close before test fails Instead of closing the device when a test fails, close it TEST_INTERVAL (10 sec) before the test fails. This is done so that the watchdog will fire at most 60 sec after the expire time (between 50 and 60 seconds instead of between 60 and 70 seconds which would be the case if we close at the expiration time; see previous commit). The timeouts in sanlock have been based on the assumption that the watchdog device fires at most 60 seconds after the expiration time, so it's best to maintain that expectation. The pre-emptive close and re-open generate pings, so they are used in place of ordinary pings. If the expire time is at T45, and is renewed/extended at T46, then the sequence of pings would be: T10 - ping from ioctl T20 - ping from ioctl T30 - ping from ioctl T40 - ping from close T50 - ping from re-open T60 - ping from ioctl ... If the expire time was *not* renewed, then the watchdog would fire at T100; which is 55 seconds after the expiration time. 55 is between the desired 50-60 second interveral. Signed-off-by: David Teigland --- diff --git a/src/timeouts.h b/src/timeouts.h index 92034b3..f62bb6f 100644 --- a/src/timeouts.h +++ b/src/timeouts.h @@ -58,14 +58,17 @@ * * 100: sanlock fails to renew host_id on disk -> no wdmd_test_live * wdmd test_client sees now 100 < expire 120 ok -> keepalive + * messages: check_our_lease warning (sanlock) * * 110: sanlock fails to renew host_id on disk -> no wdmd_test_live - * wdmd test_client sees now 110 < expire 120 ok -> keepalive + * wdmd test_client sees now 110 < expire 120 ok -> keepalive (from dev close) + * messages: watchdog closed unclean (wdmd), test warning (wdmd) * * 120: sanlock fails to renew host_id on disk -> no wdmd_test_live - * sanlock enters recovery mode and starts killing pids + * sanlock enters recovery mode and starts killing pids because we have reached + * now (120) is id_renewal_fail_seconds (80) after last renewal (40) * wdmd test_client sees now 120 >= expire 120 fail -> no keepalive - * wdmd starts logging error messages every 10 sec + * messages: check_our_lease failed (sanlock), test failed (wdmd) * * . /dev/watchdog will fire at last keepalive + watchdog_fire_timeout = * T110 + 60 = T170 diff --git a/wdmd/main.c b/wdmd/main.c index eafbf03..e289f44 100644 --- a/wdmd/main.c +++ b/wdmd/main.c @@ -407,6 +407,38 @@ static int test_clients(void) (unsigned long long)client[i].expire, client[i].name); fail_count++; + continue; + } + + /* + * If we can patch the kernel to avoid a close-ping, + * then we can remove this early/preemptive fail/close + * of the device, but instead just not pet the device + * when the expiration time is reached. Also see + * close_watchdog_unclean() below. + * + * We do this fail/close (which generates a ping) + * TEST_INTERVAL before the expire time because we want + * the device to fire at most 60 seconds after the + * expiration time. That means we need the last ping + * (from close) to be TEST_INTERVAL before to the + * expiration time. + * + * If we did the close at/after the expiration time, + * then the ping from the close would mean that the + * device would fire between 60 and 70 seconds after the + * expiration time. + */ + + if (t >= client[i].expire - DEFAULT_TEST_INTERVAL) { + log_error("test warning pid %d now %llu keepalive %llu renewal %llu expire %llu", + client[i].pid, + (unsigned long long)t, + (unsigned long long)last_keepalive, + (unsigned long long)client[i].renewal, + (unsigned long long)client[i].expire); + fail_count++; + continue; } } @@ -890,7 +922,7 @@ static int test_loop(void) /* If we can patch the kernel so that close does not generate a ping, then we can skip this close, and just not pet the device in - this case. */ + this case. Also see test_client above. */ close_watchdog_unclean(); } }