In my attempts to "migrate" our RHEL 9 systemd job to use C9S in CentOS CI I came across a strange issue - after installing a just built systemd (RHEL 9 systemd) and reboot, the node either won't come back online or dies shortly after only if it's one of the *.dusty nodes:
00:11:49.365 2021-10-25 18:52:50,240 [agent-control/wait_for_node] INFO: Checking if the node n12.dusty is alive (try #1) 00:11:49.432 2021-10-25 18:52:50,241 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ping -q -c 1 -W 10 n12.dusty 00:11:49.432 ERROR: ld.so: object 'libnss_wrapper.so' from LD_PRELOAD cannot be preloaded: ignored. 00:11:49.432 PING n12.dusty.ci.centos.org (172.19.2.76) 56(84) bytes of data. 00:11:49.432 00:11:49.432 --- n12.dusty.ci.centos.org ping statistics --- 00:11:49.432 1 packets transmitted, 1 received, 0% packet loss, time 0ms 00:11:49.432 rtt min/avg/max/mdev = 1843.871/1843.871/1843.871/0.000 ms 00:11:49.432 2021-10-25 18:52:52,093 [agent-control/wait_for_node] INFO: Node n12.dusty appears reachable again 00:11:49.433 2021-10-25 18:53:22,122 [agent-control/<module>] INFO: PHASE 3: Upstream testsuite 00:11:49.433 2021-10-25 18:53:22,123 [agent-control/execute_remote_command] INFO: Executing a REMOTE command on node'n12.dusty': systemd-centos-ci/agent/testsuite-rhel9.sh 00:11:49.433 2021-10-25 18:53:22,123 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ssh -t -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ConnectTimeout=180 -o TCPKeepAlive=yes -o ServerAliveInterval=2 -l root n12.dusty systemd-centos-ci/agent/testsuite-rhel9.sh 00:11:49.433 Pseudo-terminal will not be allocated because stdin is not a terminal. 00:11:49.442 Warning: Permanently added 'n12.dusty,172.19.2.76' (ECDSA) to the list of known hosts. 00:11:49.486 [TASK-CONTROL] OPTIMAL_QEMU_SMP = 4 00:11:49.486 [TASK-CONTROL] MAX_QUEUE_SIZE = 2 00:11:49.544 Current cgroup hierarchy: unified 00:11:49.634 No coredumps found. 00:11:49.779 ~/systemd ~ 00:11:49.821 [TASK] ninja-test (meson test -C build --print-errorlogs --timeout-multiplier=3 --num-processes 4) 00:11:49.830 Waiting for PID 5331 to finish 00:11:49.830 ...........Timeout, server n12.dusty not responding.
00:13:19.469 2021-10-26 18:43:27,764 [agent-control/wait_for_node] INFO: Checking if the node n20.dusty is alive (try #9) 00:13:19.469 2021-10-26 18:43:27,764 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ping -q -c 1 -W 10 n20.dusty 00:13:19.470 ERROR: ld.so: object 'libnss_wrapper.so' from LD_PRELOAD cannot be preloaded: ignored. 00:13:22.564 PING n20.dusty.ci.centos.org (172.19.2.84) 56(84) bytes of data. 00:13:22.564 00:13:22.564 --- n20.dusty.ci.centos.org ping statistics --- 00:13:22.564 1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms 00:13:22.564 00:13:37.581 2021-10-26 18:43:45,877 [agent-control/<module>] ERROR: Execution failed 00:13:37.581 Traceback (most recent call last): 00:13:37.581 File "./agent-control.py", line 523, in <module> 00:13:37.581 ac.reboot_node(node) 00:13:37.581 File "./agent-control.py", line 332, in reboot_node 00:13:37.581 self.wait_for_node(node, 10) 00:13:37.581 File "./agent-control.py", line 347, in wait_for_node 00:13:37.581 raise Exception("Timeout reached when waiting for node to become online") 00:13:37.581 Exception: Timeout reached when waiting for node to become online
2021-10-26 19:50:08,982 [agent-control/wait_for_node] INFO: Node n18.dusty appears reachable again 2021-10-26 19:50:38,995 [agent-control/<module>] INFO: PHASE 3: Upstream testsuite 2021-10-26 19:50:38,996 [agent-control/execute_remote_command] INFO: Executing a REMOTE command on node'n18.dusty': systemd-centos-ci/agent/testsuite-rhel9.sh 2021-10-26 19:50:38,996 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ssh -t -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ConnectTimeout=180 -o TCPKeepAlive=yes -o ServerAliveInterval=2 h Warning: Permanently added 'n18.dusty,172.19.2.82' (ECDSA) to the list of known hosts. [TASK-CONTROL] OPTIMAL_QEMU_SMP = 4 [TASK-CONTROL] MAX_QUEUE_SIZE = 2 Current cgroup hierarchy: unified No coredumps found. ~/systemd ~ [TASK] ninja-test (meson test -C build --print-errorlogs --timeout-multiplier=3 --num-processes 4) Waiting for PID 5332 to finish ............Timeout, server n18.dusty not responding.
I haven't encountered the issue with other chassis (at least for now).
I wish I could provide more information, but the machine is simply unreachable anymore once the issue is triggered.
$ while ! ssh n18.dusty; do sleep 5; done ssh: connect to host n18.dusty port 22: No route to host ssh: connect to host n18.dusty port 22: No route to host ssh: connect to host n18.dusty port 22: No route to host ssh: connect to host n18.dusty port 22: No route to host ...
Metadata Update from @arrfab: - Issue tagged with: centos-ci-infra, low-gain, medium-trouble, need-more-info
Is this issue still effecting you?
Unfortunately, yes. I wish I could debug it more, but without VNC/KVM it's basically impossible.
I tried several ugly hack to reconstitute the machine when it crashes(?), but to no avail. Also, given there's no filtering mechanism for which machine I get from Duffy, debugging this is quite wasteful, since I have to throw away all machines which are not from the dusty chassis (on average it takes five non-dusty machines to get one dusty).
Anyway, I have a reliable reproducer, so I could definitely break one such machine and then send you the hostname, so you could take a closer look (iirc there should be a serial console to each chassis, but it's kind of PITA to get there). If that sounds feasible, we can definitely coordinate either here or on IRC.
After a couple more tests I managed to isolate the problematic test:
[root@n9 systemd]# build/test-watchdog Using hardware watchdog 'iTCO_wdt', version 0, device /dev/watchdog Set hardware watchdog to 10s. Pinging... Pinging... Pinging...
After the third ping the machine simply stops responding.
Reproducer without the need to build systemd:
# cat watchdog.conf watchdog-device = /dev/watchdog watchdog-timeout = 10 interval = 5 verbose = 2 # watchdog -X 5 --foreground --verbose --config-file watchdog.conf watchdog: NOTE: Using --loop-exit so daemon will exit after 5 time intervals watchdog: String 'watchdog-device' found as '/dev/watchdog' watchdog: Integer 'watchdog-timeout' found = 10 watchdog: Integer 'interval' found = 5 watchdog: Integer 'verbose' found = 2 watchdog: starting daemon (5.16): watchdog: int=5s realtime=no sync=no load=0,0,0 soft=no watchdog: memory not checked watchdog: ping: no machine to check watchdog: file: no file to check watchdog: pidfile: no server process to check watchdog: interface: no interface to check watchdog: temperature: no sensors to check watchdog: no test binary files watchdog: no repair binary files watchdog: error retry time-out = 60 seconds watchdog: repair attempts = 1 watchdog: alive=/dev/watchdog heartbeat=[none] to=root no_act=no force=no watchdog: Warning: watchdog now set to 10 seconds, should be more than double interval = 5 watchdog: hardware watchdog identity: iTCO_wdt watchdog: Time from last refresh = 0.000020280 => 0 watchdog: Time from last refresh = 0.000028507 => 0 watchdog: Time from last refresh = 0.000040057 => 0 watchdog: Time from last refresh = 0.000046834 => 0 watchdog: Time from last refresh = 0.000053021 => 0 watchdog: Time from last refresh = 0.000058933 => 0 watchdog: still alive after 1 interval(s) watchdog: Time from last refresh = 5.000221931 => 1 watchdog: Time from last refresh = 0.000026791 => 0 watchdog: Time from last refresh = 0.000049091 => 0 watchdog: Time from last refresh = 0.000058584 => 0 watchdog: Time from last refresh = 0.000066100 => 0 watchdog: Time from last refresh = 0.000073008 => 0 watchdog: still alive after 2 interval(s) watchdog: Time from last refresh = 5.000234592 => 1 watchdog: Time from last refresh = 0.000025091 => 0 watchdog: Time from last refresh = 0.000046542 => 0 watchdog: Time from last refresh = 0.000056016 => 0 watchdog: Time from last refresh = 0.000063417 => 0 watchdog: Time from last refresh = 0.000070468 => 0 <machine stops responding>
I see, looking at the past results, only dusty nodes have HW watchdog enabled by default and that's the case only on C9S (at least from what I've managed find out from the logs so far). That's why I encountered it just now.
However, I can't trigger the issue using the SW watchdog (modprobe softdog):
modprobe softdog
# watchdog -X 5 --foreground --verbose --config-file watchdog.conf watchdog: NOTE: Using --loop-exit so daemon will exit after 5 time intervals watchdog: String 'watchdog-device' found as '/dev/watchdog' watchdog: Integer 'watchdog-timeout' found = 10 watchdog: Integer 'interval' found = 5 watchdog: Integer 'verbose' found = 2 watchdog: starting daemon (5.16): watchdog: int=5s realtime=no sync=no load=0,0,0 soft=no watchdog: memory not checked watchdog: ping: no machine to check watchdog: file: no file to check watchdog: pidfile: no server process to check watchdog: interface: no interface to check watchdog: temperature: no sensors to check watchdog: no test binary files watchdog: no repair binary files watchdog: error retry time-out = 60 seconds watchdog: repair attempts = 1 watchdog: alive=/dev/watchdog heartbeat=[none] to=root no_act=no force=no watchdog: Warning: watchdog now set to 10 seconds, should be more than double interval = 5 watchdog: hardware watchdog identity: Software Watchdog watchdog: Time from last refresh = 0.000030952 => 0 watchdog: Time from last refresh = 0.000047378 => 0 watchdog: Time from last refresh = 0.000070687 => 0 watchdog: Time from last refresh = 0.000084551 => 0 watchdog: Time from last refresh = 0.000097637 => 0 watchdog: Time from last refresh = 0.000110945 => 0 watchdog: still alive after 1 interval(s) watchdog: Time from last refresh = 5.000312826 => 1 watchdog: Time from last refresh = 0.000031203 => 0 watchdog: Time from last refresh = 0.000065960 => 0 watchdog: Time from last refresh = 0.000084680 => 0 watchdog: Time from last refresh = 0.000102911 => 0 watchdog: Time from last refresh = 0.000121451 => 0 watchdog: still alive after 2 interval(s) watchdog: Time from last refresh = 5.000319324 => 1 watchdog: Time from last refresh = 0.000024510 => 0 watchdog: Time from last refresh = 0.000055251 => 0 watchdog: Time from last refresh = 0.000070385 => 0 watchdog: Time from last refresh = 0.000100537 => 0 watchdog: Time from last refresh = 0.000116801 => 0 watchdog: still alive after 3 interval(s) watchdog: Time from last refresh = 5.000356395 => 1 watchdog: Time from last refresh = 0.000015749 => 0 watchdog: Time from last refresh = 0.000046203 => 0 watchdog: Time from last refresh = 0.000061662 => 0 watchdog: Time from last refresh = 0.000076055 => 0 watchdog: Time from last refresh = 0.000090371 => 0 watchdog: still alive after 4 interval(s) watchdog: Time from last refresh = 5.000300085 => 1 watchdog: Time from last refresh = 0.000015585 => 0 watchdog: Time from last refresh = 0.000045864 => 0 watchdog: Time from last refresh = 0.000061623 => 0 watchdog: Time from last refresh = 0.000075198 => 0 watchdog: Time from last refresh = 0.000088652 => 0 watchdog: still alive after 5 interval(s) watchdog: loop exit on interval counter reached watchdog: stopping daemon (5.16)
I also tested the iTCO watchdog with 20s timeout & 10s interval and the result is the same (the machine stops responding).
I know the watchdog utility screams that the timeout should be more than double of the interval, but the machine is not under any load, so the current timeout should be plenty (as shown by the SW watchdog). Also, the machine should come back after being rebooted by the watchdog, but it seems to be stuck in some, to me unknown state.
watchdog
I disabled the problematic test case for now, so it's no more a blocking issue, but there's definitely something fishy.
@mrc0mmand happy to debug that with you when I'll be back from PTO (still one week to go), so I'll reach out when I'll be back and we can "reserve" one dusty node to see what's wrong
Ah, interesting, upon a further investigation this looks like a bug in the iTCO_wdt kernel module:
https://bugzilla.kernel.org/show_bug.cgi?id=213809
which was fixed a ~month ago by reverting the offending commits:
https://patchwork.kernel.org/project/linux-watchdog/patch/20211008003302.1461733-1-linux@roeck-us.net/
but I guess the patch is not in the C9S kernel.
Filed to RHBZ as https://bugzilla.redhat.com/show_bug.cgi?id=2020918.
@arrfab I created a MR as discussed on IRC https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/177 to move things forward.
@mrc0mmand ack ... I'd be tempted to close this ticket as it's now more a software issue but let's keep it open and when you'll see the patch landing in stream, you'll have a way to verify yourself that it works fine and so close the ticket :)
Finally got around to test this again and the current C9S snapshots indeed contain a fixed kernel, so the tests passes once again without killing the *.dusty machines.
Metadata Update from @mrc0mmand: - Issue close_status updated to: Fixed - Issue status updated to: Closed (was: Open)
Log in to comment on this ticket.