#505 CentOS Stream 9 and *.dusty nodes
Closed: Fixed 3 years ago by mrc0mmand. Opened 3 years ago by mrc0mmand.

In my attempts to "migrate" our RHEL 9 systemd job to use C9S in CentOS CI I came across a strange issue - after installing a just built systemd (RHEL 9 systemd) and reboot, the node either won't come back online or dies shortly after only if it's one of the *.dusty nodes:

00:11:49.365 2021-10-25 18:52:50,240 [agent-control/wait_for_node] INFO: Checking if the node n12.dusty is alive (try #1)
00:11:49.432 2021-10-25 18:52:50,241 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ping -q -c 1 -W 10 n12.dusty
00:11:49.432 ERROR: ld.so: object 'libnss_wrapper.so' from LD_PRELOAD cannot be preloaded: ignored.
00:11:49.432 PING n12.dusty.ci.centos.org (172.19.2.76) 56(84) bytes of data.
00:11:49.432 
00:11:49.432 --- n12.dusty.ci.centos.org ping statistics ---
00:11:49.432 1 packets transmitted, 1 received, 0% packet loss, time 0ms
00:11:49.432 rtt min/avg/max/mdev = 1843.871/1843.871/1843.871/0.000 ms
00:11:49.432 2021-10-25 18:52:52,093 [agent-control/wait_for_node] INFO: Node n12.dusty appears reachable again
00:11:49.433 2021-10-25 18:53:22,122 [agent-control/<module>] INFO: PHASE 3: Upstream testsuite
00:11:49.433 2021-10-25 18:53:22,123 [agent-control/execute_remote_command] INFO: Executing a REMOTE command on node'n12.dusty': systemd-centos-ci/agent/testsuite-rhel9.sh
00:11:49.433 2021-10-25 18:53:22,123 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ssh -t -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ConnectTimeout=180 -o TCPKeepAlive=yes -o ServerAliveInterval=2 -l root n12.dusty systemd-centos-ci/agent/testsuite-rhel9.sh
00:11:49.433 Pseudo-terminal will not be allocated because stdin is not a terminal.
00:11:49.442 Warning: Permanently added 'n12.dusty,172.19.2.76' (ECDSA) to the list of known hosts.
00:11:49.486 [TASK-CONTROL] OPTIMAL_QEMU_SMP = 4
00:11:49.486 [TASK-CONTROL] MAX_QUEUE_SIZE = 2
00:11:49.544 Current cgroup hierarchy: unified
00:11:49.634 No coredumps found.
00:11:49.779 ~/systemd ~
00:11:49.821 [TASK] ninja-test (meson test -C build --print-errorlogs --timeout-multiplier=3 --num-processes 4)
00:11:49.830 Waiting for PID 5331 to finish
00:11:49.830 ...........Timeout, server n12.dusty not responding.
00:13:19.469 2021-10-26 18:43:27,764 [agent-control/wait_for_node] INFO: Checking if the node n20.dusty is alive (try #9)
00:13:19.469 2021-10-26 18:43:27,764 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ping -q -c 1 -W 10 n20.dusty
00:13:19.470 ERROR: ld.so: object 'libnss_wrapper.so' from LD_PRELOAD cannot be preloaded: ignored.
00:13:22.564 PING n20.dusty.ci.centos.org (172.19.2.84) 56(84) bytes of data.
00:13:22.564 
00:13:22.564 --- n20.dusty.ci.centos.org ping statistics ---
00:13:22.564 1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms
00:13:22.564 
00:13:37.581 2021-10-26 18:43:45,877 [agent-control/<module>] ERROR: Execution failed
00:13:37.581 Traceback (most recent call last):
00:13:37.581   File "./agent-control.py", line 523, in <module>
00:13:37.581     ac.reboot_node(node)
00:13:37.581   File "./agent-control.py", line 332, in reboot_node
00:13:37.581     self.wait_for_node(node, 10)
00:13:37.581   File "./agent-control.py", line 347, in wait_for_node
00:13:37.581     raise Exception("Timeout reached when waiting for node to become online")
00:13:37.581 Exception: Timeout reached when waiting for node to become online
2021-10-26 19:50:08,982 [agent-control/wait_for_node] INFO: Node n18.dusty appears reachable again
2021-10-26 19:50:38,995 [agent-control/<module>] INFO: PHASE 3: Upstream testsuite
2021-10-26 19:50:38,996 [agent-control/execute_remote_command] INFO: Executing a REMOTE command on node'n18.dusty': systemd-centos-ci/agent/testsuite-rhel9.sh
2021-10-26 19:50:38,996 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ssh -t -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ConnectTimeout=180 -o TCPKeepAlive=yes -o ServerAliveInterval=2 h
Warning: Permanently added 'n18.dusty,172.19.2.82' (ECDSA) to the list of known hosts.


[TASK-CONTROL] OPTIMAL_QEMU_SMP = 4
[TASK-CONTROL] MAX_QUEUE_SIZE = 2
Current cgroup hierarchy: unified
No coredumps found.
~/systemd ~
[TASK] ninja-test (meson test -C build --print-errorlogs --timeout-multiplier=3 --num-processes 4)
Waiting for PID 5332 to finish
............Timeout, server n18.dusty not responding.

I haven't encountered the issue with other chassis (at least for now).

I wish I could provide more information, but the machine is simply unreachable anymore once the issue is triggered.

$ while ! ssh n18.dusty; do sleep 5; done
ssh: connect to host n18.dusty port 22: No route to host
ssh: connect to host n18.dusty port 22: No route to host
ssh: connect to host n18.dusty port 22: No route to host
ssh: connect to host n18.dusty port 22: No route to host
...

Metadata Update from @arrfab:
- Issue tagged with: centos-ci-infra, low-gain, medium-trouble, need-more-info

3 years ago

Is this issue still effecting you?

Unfortunately, yes. I wish I could debug it more, but without VNC/KVM it's basically impossible.

I tried several ugly hack to reconstitute the machine when it crashes(?), but to no avail. Also, given there's no filtering mechanism for which machine I get from Duffy, debugging this is quite wasteful, since I have to throw away all machines which are not from the dusty chassis (on average it takes five non-dusty machines to get one dusty).

Anyway, I have a reliable reproducer, so I could definitely break one such machine and then send you the hostname, so you could take a closer look (iirc there should be a serial console to each chassis, but it's kind of PITA to get there). If that sounds feasible, we can definitely coordinate either here or on IRC.

After a couple more tests I managed to isolate the problematic test:

[root@n9 systemd]# build/test-watchdog
Using hardware watchdog 'iTCO_wdt', version 0, device /dev/watchdog
Set hardware watchdog to 10s.
Pinging...
Pinging...
Pinging...

After the third ping the machine simply stops responding.

Reproducer without the need to build systemd:

# cat watchdog.conf 
watchdog-device = /dev/watchdog
watchdog-timeout = 10
interval = 5
verbose = 2
# watchdog -X 5 --foreground  --verbose --config-file watchdog.conf 
watchdog: NOTE: Using --loop-exit so daemon will exit after 5 time intervals
watchdog: String 'watchdog-device' found as '/dev/watchdog'
watchdog: Integer 'watchdog-timeout' found = 10
watchdog: Integer 'interval' found = 5
watchdog: Integer 'verbose' found = 2
watchdog: starting daemon (5.16):
watchdog:  int=5s realtime=no sync=no load=0,0,0 soft=no
watchdog:  memory not checked
watchdog:  ping: no machine to check
watchdog:  file: no file to check
watchdog:  pidfile: no server process to check
watchdog:  interface: no interface to check
watchdog:  temperature: no sensors to check
watchdog:  no test binary files
watchdog:  no repair binary files
watchdog:  error retry time-out = 60 seconds
watchdog:  repair attempts = 1
watchdog:  alive=/dev/watchdog heartbeat=[none] to=root no_act=no force=no
watchdog: Warning: watchdog now set to 10 seconds, should be more than double interval = 5
watchdog: hardware watchdog identity: iTCO_wdt
watchdog: Time from last refresh = 0.000020280 => 0
watchdog: Time from last refresh = 0.000028507 => 0
watchdog: Time from last refresh = 0.000040057 => 0
watchdog: Time from last refresh = 0.000046834 => 0
watchdog: Time from last refresh = 0.000053021 => 0
watchdog: Time from last refresh = 0.000058933 => 0
watchdog: still alive after 1 interval(s)
watchdog: Time from last refresh = 5.000221931 => 1
watchdog: Time from last refresh = 0.000026791 => 0
watchdog: Time from last refresh = 0.000049091 => 0
watchdog: Time from last refresh = 0.000058584 => 0
watchdog: Time from last refresh = 0.000066100 => 0
watchdog: Time from last refresh = 0.000073008 => 0
watchdog: still alive after 2 interval(s)
watchdog: Time from last refresh = 5.000234592 => 1
watchdog: Time from last refresh = 0.000025091 => 0
watchdog: Time from last refresh = 0.000046542 => 0
watchdog: Time from last refresh = 0.000056016 => 0
watchdog: Time from last refresh = 0.000063417 => 0
watchdog: Time from last refresh = 0.000070468 => 0
<machine stops responding>

I see, looking at the past results, only dusty nodes have HW watchdog enabled by default and that's the case only on C9S (at least from what I've managed find out from the logs so far). That's why I encountered it just now.

However, I can't trigger the issue using the SW watchdog (modprobe softdog):

# watchdog -X 5 --foreground --verbose --config-file watchdog.conf 
watchdog: NOTE: Using --loop-exit so daemon will exit after 5 time intervals
watchdog: String 'watchdog-device' found as '/dev/watchdog'
watchdog: Integer 'watchdog-timeout' found = 10
watchdog: Integer 'interval' found = 5
watchdog: Integer 'verbose' found = 2
watchdog: starting daemon (5.16):
watchdog:  int=5s realtime=no sync=no load=0,0,0 soft=no
watchdog:  memory not checked
watchdog:  ping: no machine to check
watchdog:  file: no file to check
watchdog:  pidfile: no server process to check
watchdog:  interface: no interface to check
watchdog:  temperature: no sensors to check
watchdog:  no test binary files
watchdog:  no repair binary files
watchdog:  error retry time-out = 60 seconds
watchdog:  repair attempts = 1
watchdog:  alive=/dev/watchdog heartbeat=[none] to=root no_act=no force=no
watchdog: Warning: watchdog now set to 10 seconds, should be more than double interval = 5
watchdog: hardware watchdog identity: Software Watchdog
watchdog: Time from last refresh = 0.000030952 => 0
watchdog: Time from last refresh = 0.000047378 => 0
watchdog: Time from last refresh = 0.000070687 => 0
watchdog: Time from last refresh = 0.000084551 => 0
watchdog: Time from last refresh = 0.000097637 => 0
watchdog: Time from last refresh = 0.000110945 => 0
watchdog: still alive after 1 interval(s)
watchdog: Time from last refresh = 5.000312826 => 1
watchdog: Time from last refresh = 0.000031203 => 0
watchdog: Time from last refresh = 0.000065960 => 0
watchdog: Time from last refresh = 0.000084680 => 0
watchdog: Time from last refresh = 0.000102911 => 0
watchdog: Time from last refresh = 0.000121451 => 0
watchdog: still alive after 2 interval(s)
watchdog: Time from last refresh = 5.000319324 => 1
watchdog: Time from last refresh = 0.000024510 => 0
watchdog: Time from last refresh = 0.000055251 => 0
watchdog: Time from last refresh = 0.000070385 => 0
watchdog: Time from last refresh = 0.000100537 => 0
watchdog: Time from last refresh = 0.000116801 => 0
watchdog: still alive after 3 interval(s)
watchdog: Time from last refresh = 5.000356395 => 1
watchdog: Time from last refresh = 0.000015749 => 0
watchdog: Time from last refresh = 0.000046203 => 0
watchdog: Time from last refresh = 0.000061662 => 0
watchdog: Time from last refresh = 0.000076055 => 0
watchdog: Time from last refresh = 0.000090371 => 0
watchdog: still alive after 4 interval(s)
watchdog: Time from last refresh = 5.000300085 => 1
watchdog: Time from last refresh = 0.000015585 => 0
watchdog: Time from last refresh = 0.000045864 => 0
watchdog: Time from last refresh = 0.000061623 => 0
watchdog: Time from last refresh = 0.000075198 => 0
watchdog: Time from last refresh = 0.000088652 => 0
watchdog: still alive after 5 interval(s)
watchdog: loop exit on interval counter reached
watchdog: stopping daemon (5.16)

I also tested the iTCO watchdog with 20s timeout & 10s interval and the result is the same (the machine stops responding).

I know the watchdog utility screams that the timeout should be more than double of the interval, but the machine is not under any load, so the current timeout should be plenty (as shown by the SW watchdog). Also, the machine should come back after being rebooted by the watchdog, but it seems to be stuck in some, to me unknown state.

I disabled the problematic test case for now, so it's no more a blocking issue, but there's definitely something fishy.

@mrc0mmand happy to debug that with you when I'll be back from PTO (still one week to go), so I'll reach out when I'll be back and we can "reserve" one dusty node to see what's wrong

Ah, interesting, upon a further investigation this looks like a bug in the iTCO_wdt kernel module:

https://bugzilla.kernel.org/show_bug.cgi?id=213809

which was fixed a ~month ago by reverting the offending commits:

https://patchwork.kernel.org/project/linux-watchdog/patch/20211008003302.1461733-1-linux@roeck-us.net/

but I guess the patch is not in the C9S kernel.

@mrc0mmand ack ... I'd be tempted to close this ticket as it's now more a software issue but let's keep it open and when you'll see the patch landing in stream, you'll have a way to verify yourself that it works fine and so close the ticket :)

Finally got around to test this again and the current C9S snapshots indeed contain a fixed kernel, so the tests passes once again without killing the *.dusty machines.

Metadata Update from @mrc0mmand:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

3 years ago

Log in to comment on this ticket.

Metadata
Boards 1
CentOS CI Infra Status: Backlog