#865 Unstable-ish Duffy machines
Closed: Fixed 2 years ago by mrc0mmand. Opened 2 years ago by mrc0mmand.

Today I noticed that some of the Duffy nodes are a bit unstable - is this related to the "shuffling-around" caused by the pool migration? It doesn't matter which Duffy API I use:

2022-08-01 09:28:48,790 [agent-control/allocate_node] INFO: Attempting to allocate a node (version: 8-stream, arch: x86_64, flavor: n/a)
2022-08-01 09:28:48,790 [agent-control/_execute_api_command] INFO: Duffy request URL: http://admin.ci.centos.org:8080/Node/get
2022-08-01 09:28:49,885 [agent-control/allocate_node] INFO: Successfully allocated node 'n49.pufty' (121)
2022-08-01 09:28:49,886 [agent-control/main] INFO: PHASE 1: Setting up basic dependencies to configure CI repository
2022-08-01 09:28:49,886 [agent-control/execute_remote_command] INFO: Executing a REMOTE command on node 'n49.pufty': dnf clean all && dnf makecache && dnf -y install bash git rsync && rm -fr systemd-centos-ci && git clone https://github.com/systemd/systemd-centos-ci
2022-08-01 09:28:49,886 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ssh -t -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ConnectTimeout=180 -o TCPKeepAlive=yes -o ServerAliveInterval=2 -l root n49.pufty dnf clean all && dnf makecache && dnf -y install bash git rsync && rm -fr systemd-centos-ci && git clone https://github.com/systemd/systemd-centos-ci
Pseudo-terminal will not be allocated because stdin is not a terminal.
Warning: Permanently added 'n49.pufty,172.19.3.113' (ECDSA) to the list of known hosts.
27 files removed
CentOS Stream 8 - AppStream                      51 MB/s |  24 MB     00:00    
failed to open: /var/cache/dnf/appstream-773ef6463612e8e2/repodata/2f5f47034f44cea88724f5ed0beaa3027732a937966ad51a180a4cd8986b89bb-filelists.xml.gz
2022-08-01 09:28:53,851 [agent-control/main] ERROR: Execution failed
2022-08-01 16:51:34,367 [agent-control/allocate_node] INFO: Attempting to allocate a node from pool metal-seamicro-large-centos-8s-x86_64
2022-08-01 16:51:35,388 [agent-control/allocate_node] INFO: Allocated node n19.gusty.ci.centos.org with session id 259
2022-08-01 16:51:35,388 [agent-control/main] INFO: PHASE 1: Setting up basic dependencies to configure CI repository
2022-08-01 16:51:35,388 [agent-control/execute_remote_command] INFO: Executing a REMOTE command on node 'n19.gusty.ci.centos.org': dnf clean all && dnf makecache && dnf -y install bash git rsync && rm -fr systemd-centos-ci && git clone https://github.com/systemd/systemd-centos-ci
2022-08-01 16:51:35,388 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ssh -t -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ConnectTimeout=180 -o TCPKeepAlive=yes -o ServerAliveInterval=2 -l root n19.gusty.ci.centos.org dnf clean all && dnf makecache && dnf -y install bash git rsync && rm -fr systemd-centos-ci && git clone https://github.com/systemd/systemd-centos-ci
Pseudo-terminal will not be allocated because stdin is not a terminal.
Warning: Permanently added 'n19.gusty.ci.centos.org,172.19.2.147' (ECDSA) to the list of known hosts.
0 files removed
CentOS Stream 8 - AppStream                      62 MB/s |  24 MB     00:00    
failed to open: /var/cache/dnf/appstream-773ef6463612e8e2/repodata/2f5f47034f44cea88724f5ed0beaa3027732a937966ad51a180a4cd8986b89bb-filelists.xml.gz
2022-08-01 16:51:39,443 [agent-control/main] ERROR: Execution failed
2022-08-01 16:51:34,205 [agent-control/allocate_node] INFO: Attempting to allocate a node from pool metal-seamicro-large-centos-8s-x86_64
2022-08-01 16:51:35,375 [agent-control/allocate_node] INFO: Allocated node n19.gusty.ci.centos.org with session id 258
2022-08-01 16:51:35,375 [agent-control/main] INFO: PHASE 1: Setting up basic dependencies to configure CI repository
2022-08-01 16:51:35,375 [agent-control/execute_remote_command] INFO: Executing a REMOTE command on node 'n19.gusty.ci.centos.org': dnf clean all && dnf makecache && dnf -y install bash git rsync && rm -fr systemd-centos-ci && git clone https://github.com/systemd/systemd-centos-ci
2022-08-01 16:51:35,375 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ssh -t -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ConnectTimeout=180 -o TCPKeepAlive=yes -o ServerAliveInterval=2 -l root n19.gusty.ci.centos.org dnf clean all && dnf makecache && dnf -y install bash git rsync && rm -fr systemd-centos-ci && git clone https://github.com/systemd/systemd-centos-ci
Pseudo-terminal will not be allocated because stdin is not a terminal.
Warning: Permanently added 'n19.gusty.ci.centos.org,172.19.2.147' (ECDSA) to the list of known hosts.
6 files removed
Waiting for process with pid 6620 to finish.
CentOS Stream 8 - AppStream                      62 MB/s |  24 MB     00:00    
Timeout, server n19.gusty.ci.centos.org not responding.

/cc @arrfab


Metadata Update from @arrfab:
- Issue tagged with: centos-ci-infra, high-gain, investigation, medium-trouble

2 years ago

for the gusty one it's something I had an issue yesterday with, so I had to ask for a full chassis reset (again).
but for the other ones, that seems to be different, as you have access but it's like if there was a corrupted metadata on it ? (strange)
Do you see that behavior again today ?

Unfortunately, yes:

2022-08-02 08:30:35,339 [agent-control/allocate_node] INFO: Attempting to allocate a node from pool metal-seamicro-large-centos-8s-x86_64
2022-08-02 08:30:36,451 [agent-control/allocate_node] INFO: Allocated node n28.gusty.ci.centos.org with session id 403
2022-08-02 08:30:36,452 [agent-control/main] INFO: PHASE 1: Setting up basic dependencies to configure CI repository
2022-08-02 08:30:36,452 [agent-control/execute_remote_command] INFO: Executing a REMOTE command on node 'n28.gusty.ci.centos.org': dnf clean all && dnf makecache && dnf -y install bash git rsync && rm -fr systemd-centos-ci && git clone https://github.com/systemd/systemd-centos-ci
2022-08-02 08:30:36,452 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ssh -t -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ConnectTimeout=180 -o TCPKeepAlive=yes -o ServerAliveInterval=2 -l root n28.gusty.ci.centos.org dnf clean all && dnf makecache && dnf -y install bash git rsync && rm -fr systemd-centos-ci && git clone https://github.com/systemd/systemd-centos-ci
Pseudo-terminal will not be allocated because stdin is not a terminal.
Warning: Permanently added 'n28.gusty.ci.centos.org,172.19.2.156' (ECDSA) to the list of known hosts.
Waiting for process with pid 11712 to finish.
6 files removed
CentOS Stream 8 - AppStream                      59 MB/s |  24 MB     00:00    
Timeout, server n28.gusty.ci.centos.org not responding.
2022-08-02 08:30:50,980 [agent-control/main] ERROR: Execution failed
2022-08-02 08:30:35,265 [agent-control/allocate_node] INFO: Attempting to allocate a node from pool metal-seamicro-large-centos-8s-x86_64
2022-08-02 08:30:36,437 [agent-control/allocate_node] INFO: Allocated node n28.gusty.ci.centos.org with session id 402
2022-08-02 08:30:36,437 [agent-control/main] INFO: PHASE 1: Setting up basic dependencies to configure CI repository
2022-08-02 08:30:36,437 [agent-control/execute_remote_command] INFO: Executing a REMOTE command on node 'n28.gusty.ci.centos.org': dnf clean all && dnf makecache && dnf -y install bash git rsync && rm -fr systemd-centos-ci && git clone https://github.com/systemd/systemd-centos-ci
2022-08-02 08:30:36,437 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ssh -t -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ConnectTimeout=180 -o TCPKeepAlive=yes -o ServerAliveInterval=2 -l root n28.gusty.ci.centos.org dnf clean all && dnf makecache && dnf -y install bash git rsync && rm -fr systemd-centos-ci && git clone https://github.com/systemd/systemd-centos-ci
Pseudo-terminal will not be allocated because stdin is not a terminal.
Warning: Permanently added 'n28.gusty.ci.centos.org,172.19.2.156' (ECDSA) to the list of known hosts.
27 files removed
CentOS Stream 8 - AppStream                      60 MB/s |  24 MB     00:00    
failed to open: /var/cache/dnf/appstream-773ef6463612e8e2/repodata/2f5f47034f44cea88724f5ed0beaa3027732a937966ad51a180a4cd8986b89bb-filelists.xml.gz
2022-08-02 08:30:40,375 [agent-control/main] ERROR: Execution failed

Also, this is also happening from time to time since yesterday:

2022-08-02 08:26:31,035 [agent-control/allocate_node] INFO: Attempting to allocate a node from pool metal-seamicro-large-centos-8s-x86_64
2022-08-02 08:26:32,193 [agent-control/allocate_node] INFO: Allocated node n23.pufty.ci.centos.org with session id 396
2022-08-02 08:26:32,193 [agent-control/main] INFO: PHASE 1: Setting up basic dependencies to configure CI repository
2022-08-02 08:26:32,193 [agent-control/execute_remote_command] INFO: Executing a REMOTE command on node 'n23.pufty.ci.centos.org': dnf clean all && dnf makecache && dnf -y install bash git rsync && rm -fr systemd-centos-ci && git clone https://github.com/systemd/systemd-centos-ci
2022-08-02 08:26:32,193 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ssh -t -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ConnectTimeout=180 -o TCPKeepAlive=yes -o ServerAliveInterval=2 -l root n23.pufty.ci.centos.org dnf clean all && dnf makecache && dnf -y install bash git rsync && rm -fr systemd-centos-ci && git clone https://github.com/systemd/systemd-centos-ci
Pseudo-terminal will not be allocated because stdin is not a terminal.
Warning: Permanently added 'n23.pufty.ci.centos.org,172.19.3.87' (ECDSA) to the list of known hosts.
FATAL: command execution failed
java.nio.channels.ClosedChannelException
    at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.onReadClosed(ChannelApplicationLayer.java:209)
    at org.jenkinsci.remoting.protocol.ApplicationLayer.onRecvClosed(ApplicationLayer.java:221)
    at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:816)
    at org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:287)
    at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:181)
    at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.switchToNoSecure(SSLEngineFilterLayer.java:283)
    at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processWrite(SSLEngineFilterLayer.java:503)
    at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processQueuedWrites(SSLEngineFilterLayer.java:248)
    at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doSend(SSLEngineFilterLayer.java:200)
    at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doCloseSend(SSLEngineFilterLayer.java:213)
    at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doCloseSend(ProtocolStack.java:784)
    at org.jenkinsci.remoting.protocol.ApplicationLayer.doCloseWrite(ApplicationLayer.java:172)
    at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer$ByteBufferCommandTransport.closeWrite(ChannelApplicationLayer.java:314)
    at hudson.remoting.Channel.close(Channel.java:1493)
    at hudson.remoting.Channel.close(Channel.java:1446)
    at hudson.slaves.SlaveComputer.closeChannel(SlaveComputer.java:872)
    at hudson.slaves.SlaveComputer.access$100(SlaveComputer.java:113)
    at hudson.slaves.SlaveComputer$2.run(SlaveComputer.java:763)
    at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
    at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:834)
Caused: java.io.IOException: Backing channel 'JNLP4-connect connection from 10.129.2.54/10.129.2.54:35034' is disconnected.
    at hudson.remoting.RemoteInvocationHandler.channelOrFail(RemoteInvocationHandler.java:216)
    at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:285)
    at com.sun.proxy.$Proxy113.isAlive(Unknown Source)
    at hudson.Launcher$RemoteLauncher$ProcImpl.isAlive(Launcher.java:1147)
    at hudson.Launcher$RemoteLauncher$ProcImpl.join(Launcher.java:1139)
    at hudson.tasks.CommandInterpreter.join(CommandInterpreter.java:155)
    at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:109)
    at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:66)
    at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
    at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:741)
    at hudson.model.Build$BuildExecution.build(Build.java:206)
    at hudson.model.Build$BuildExecution.doRun(Build.java:163)
    at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:504)
    at hudson.model.Run.execute(Run.java:1880)
    at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
    at hudson.model.ResourceController.execute(ResourceController.java:97)
    at hudson.model.Executor.run(Executor.java:428)
FATAL: Unable to delete script file /tmp/jenkins5743870364005224433.sh
java.nio.channels.ClosedChannelException
    at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.onReadClosed(ChannelApplicationLayer.java:209)
    at org.jenkinsci.remoting.protocol.ApplicationLayer.onRecvClosed(ApplicationLayer.java:221)
    at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:816)
    at org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:287)
    at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:181)
    at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.switchToNoSecure(SSLEngineFilterLayer.java:283)
    at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processWrite(SSLEngineFilterLayer.java:503)
    at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processQueuedWrites(SSLEngineFilterLayer.java:248)
    at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doSend(SSLEngineFilterLayer.java:200)
    at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doCloseSend(SSLEngineFilterLayer.java:213)
    at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doCloseSend(ProtocolStack.java:784)
    at org.jenkinsci.remoting.protocol.ApplicationLayer.doCloseWrite(ApplicationLayer.java:172)
    at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer$ByteBufferCommandTransport.closeWrite(ChannelApplicationLayer.java:314)
    at hudson.remoting.Channel.close(Channel.java:1493)
    at hudson.remoting.Channel.close(Channel.java:1446)
    at hudson.slaves.SlaveComputer.closeChannel(SlaveComputer.java:872)
    at hudson.slaves.SlaveComputer.access$100(SlaveComputer.java:113)
    at hudson.slaves.SlaveComputer$2.run(SlaveComputer.java:763)
    at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
    at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:834)
Caused: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@49b0491b:JNLP4-connect connection from 10.129.2.54/10.129.2.54:35034": Remote call on JNLP4-connect connection from 10.129.2.54/10.129.2.54:35034 failed. The channel is closing down or has closed down
    at hudson.remoting.Channel.call(Channel.java:991)
    at hudson.FilePath.act(FilePath.java:1069)
    at hudson.FilePath.act(FilePath.java:1058)
    at hudson.FilePath.delete(FilePath.java:1543)
    at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:123)
    at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:66)
    at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
    at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:741)
    at hudson.model.Build$BuildExecution.build(Build.java:206)
    at hudson.model.Build$BuildExecution.doRun(Build.java:163)
    at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:504)
    at hudson.model.Run.execute(Run.java:1880)
    at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
    at hudson.model.ResourceController.execute(ResourceController.java:97)
    at hudson.model.Executor.run(Executor.java:428)
Build step 'Execute shell' marked build as failure

not sure if related.

And it doesn't seem to be gusty-specific:

2022-08-02 11:12:12,626 [agent-control/allocate_node] INFO: Attempting to allocate a node from pool metal-seamicro-large-centos-8s-x86_64
2022-08-02 11:12:14,003 [agent-control/allocate_node] INFO: Allocated node n53.pufty.ci.centos.org with session id 434
2022-08-02 11:12:14,003 [agent-control/main] INFO: PHASE 1: Setting up basic dependencies to configure CI repository
2022-08-02 11:12:14,003 [agent-control/execute_remote_command] INFO: Executing a REMOTE command on node 'n53.pufty.ci.centos.org': dnf clean all && dnf makecache && dnf -y install bash git rsync && rm -fr systemd-centos-ci && git clone https://github.com/systemd/systemd-centos-ci
2022-08-02 11:12:14,003 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ssh -t -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ConnectTimeout=180 -o TCPKeepAlive=yes -o ServerAliveInterval=2 -l root n53.pufty.ci.centos.org dnf clean all && dnf makecache && dnf -y install bash git rsync && rm -fr systemd-centos-ci && git clone https://github.com/systemd/systemd-centos-ci
Pseudo-terminal will not be allocated because stdin is not a terminal.
Warning: Permanently added 'n53.pufty.ci.centos.org,172.19.3.117' (ECDSA) to the list of known hosts.
0 files removed
CentOS Stream 8 - AppStream                      54 MB/s |  24 MB     00:00    
failed to open: /var/cache/dnf/appstream-773ef6463612e8e2/repodata/2f5f47034f44cea88724f5ed0beaa3027732a937966ad51a180a4cd8986b89bb-filelists.xml.gz
2022-08-02 11:12:18,075 [agent-control/main] ERROR: Execution failed
2022-08-02 11:12:12,885 [agent-control/allocate_node] INFO: Attempting to allocate a node from pool metal-seamicro-large-centos-8s-x86_64
2022-08-02 11:12:14,593 [agent-control/allocate_node] INFO: Allocated node n10.pufty.ci.centos.org with session id 436
2022-08-02 11:12:14,593 [agent-control/main] INFO: PHASE 1: Setting up basic dependencies to configure CI repository
2022-08-02 11:12:14,593 [agent-control/execute_remote_command] INFO: Executing a REMOTE command on node 'n10.pufty.ci.centos.org': dnf clean all && dnf makecache && dnf -y install bash git rsync && rm -fr systemd-centos-ci && git clone https://github.com/systemd/systemd-centos-ci
2022-08-02 11:12:14,593 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ssh -t -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ConnectTimeout=180 -o TCPKeepAlive=yes -o ServerAliveInterval=2 -l root n10.pufty.ci.centos.org dnf clean all && dnf makecache && dnf -y install bash git rsync && rm -fr systemd-centos-ci && git clone https://github.com/systemd/systemd-centos-ci
Pseudo-terminal will not be allocated because stdin is not a terminal.
Warning: Permanently added 'n10.pufty.ci.centos.org,172.19.3.74' (ECDSA) to the list of known hosts.
0 files removed
CentOS Stream 8 - AppStream                      53 MB/s |  24 MB     00:00    
failed to open: /var/cache/dnf/appstream-773ef6463612e8e2/repodata/2f5f47034f44cea88724f5ed0beaa3027732a937966ad51a180a4cd8986b89bb-filelists.xml.gz
2022-08-02 11:12:18,704 [agent-control/main] ERROR: Execution failed
2022-08-02 11:12:12,528 [agent-control/allocate_node] INFO: Attempting to allocate a node from pool metal-seamicro-large-centos-8s-x86_64
2022-08-02 11:12:13,985 [agent-control/allocate_node] INFO: Allocated node n53.pufty.ci.centos.org with session id 433
2022-08-02 11:12:13,986 [agent-control/main] INFO: PHASE 1: Setting up basic dependencies to configure CI repository
2022-08-02 11:12:13,986 [agent-control/execute_remote_command] INFO: Executing a REMOTE command on node 'n53.pufty.ci.centos.org': dnf clean all && dnf makecache && dnf -y install bash git rsync && rm -fr systemd-centos-ci && git clone https://github.com/systemd/systemd-centos-ci
2022-08-02 11:12:13,986 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ssh -t -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ConnectTimeout=180 -o TCPKeepAlive=yes -o ServerAliveInterval=2 -l root n53.pufty.ci.centos.org dnf clean all && dnf makecache && dnf -y install bash git rsync && rm -fr systemd-centos-ci && git clone https://github.com/systemd/systemd-centos-ci
Pseudo-terminal will not be allocated because stdin is not a terminal.
Warning: Permanently added 'n53.pufty.ci.centos.org,172.19.3.117' (ECDSA) to the list of known hosts.
0 files removed
Waiting for process with pid 6896 to finish.
CentOS Stream 8 - AppStream                      51 MB/s |  24 MB     00:00    
Timeout, server n53.pufty.ci.centos.org not responding.
2022-08-02 11:12:32,723 [agent-control/main] ERROR: Execution failed
2022-08-02 11:12:12,473 [agent-control/allocate_node] INFO: Attempting to allocate a node from pool metal-seamicro-large-centos-8s-x86_64
2022-08-02 11:12:13,961 [agent-control/allocate_node] INFO: Allocated node n53.pufty.ci.centos.org with session id 432
2022-08-02 11:12:13,961 [agent-control/main] INFO: PHASE 1: Setting up basic dependencies to configure CI repository
2022-08-02 11:12:13,961 [agent-control/execute_remote_command] INFO: Executing a REMOTE command on node 'n53.pufty.ci.centos.org': dnf clean all && dnf makecache && dnf -y install bash git rsync && rm -fr systemd-centos-ci && git clone https://github.com/systemd/systemd-centos-ci
2022-08-02 11:12:13,961 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ssh -t -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ConnectTimeout=180 -o TCPKeepAlive=yes -o ServerAliveInterval=2 -l root n53.pufty.ci.centos.org dnf clean all && dnf makecache && dnf -y install bash git rsync && rm -fr systemd-centos-ci && git clone https://github.com/systemd/systemd-centos-ci
Pseudo-terminal will not be allocated because stdin is not a terminal.
Warning: Permanently added 'n53.pufty.ci.centos.org,172.19.3.117' (ECDSA) to the list of known hosts.
6 files removed
Waiting for process with pid 6896 to finish.
Waiting for process with pid 6903 to finish.
Timeout, server n53.pufty.ci.centos.org not responding.

Maybe the machine is still provisioning at the time we get an access to it?

Oh well, the pufty chassis is giving up:

00:18:43.942     archlinux_systemd_ci: [1531/1710] Generating man/systemd.resource-control.html with a custom command
00:18:44.157     archlinux_systemd_ci: [1532/1710] Generating man/systemd.slice.html with a custom command
00:18:44.179     archlinux_systemd_ci: [1533/1710] Generating man/systemd.network.html with a custom command
00:18:44.288     archlinux_systemd_ci: [1534/1710] Generating man/systemd.scope.5 with a custom command
00:18:44.340     archlinux_systemd_ci: [1535/1710] Generating man/systemd.service.html with a custom command
00:18:44.368     archlinux_systemd_ci: [1536/1710] Generating man/systemd.resource-control.5 with a custom command
00:18:44.379     archlinux_systemd_ci: [1537/1710] Generating man/systemd.service.5 with a custom command
00:18:44.431     archlinux_systemd_ci: [1538/1710] Generating man/systemd.slice.5 with a custom command
00:18:52.439 Timeout, server n26.pufty.ci.centos.org not responding.
00:18:52.440 2022-08-02 11:33:18,255 [agent-control/fetch_artifacts] INFO: Fetching artifacts from node n26.pufty.ci.centos.org: (remote: ~/vagrant-logs*, local: ./artifacts_8i1r4m69)
00:18:52.440 2022-08-02 11:33:18,255 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/scp -r -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no root@n26.pufty.ci.centos.org:~/vagrant-logs* ./artifacts_8i1r4m69
00:19:02.633 ssh: connect to host n26.pufty.ci.centos.org port 22: No route to host
00:00:59.743 2022-08-02 11:59:19,811 [agent-control/allocate_node] INFO: Attempting to allocate a node from pool metal-seamicro-large-centos-8s-x86_64
00:01:00.751 2022-08-02 11:59:20,819 [agent-control/allocate_node] INFO: Allocated node n45.pufty.ci.centos.org with session id 464
00:01:00.751 2022-08-02 11:59:20,819 [agent-control/main] INFO: PHASE 1: Setting up basic dependencies to configure CI repository
00:01:00.752 2022-08-02 11:59:20,819 [agent-control/execute_remote_command] INFO: Executing a REMOTE command on node 'n45.pufty.ci.centos.org': dnf clean all && dnf makecache && dnf -y install bash git rsync && rm -fr systemd-centos-ci && git clone https://github.com/systemd/systemd-centos-ci
00:01:00.752 2022-08-02 11:59:20,819 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ssh -t -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ConnectTimeout=180 -o TCPKeepAlive=yes -o ServerAliveInterval=2 -l root n45.pufty.ci.centos.org dnf clean all && dnf makecache && dnf -y install bash git rsync && rm -fr systemd-centos-ci && git clone https://github.com/systemd/systemd-centos-ci
00:01:00.757 Pseudo-terminal will not be allocated because stdin is not a terminal.
00:01:00.789 Warning: Permanently added 'n45.pufty.ci.centos.org,172.19.3.109' (ECDSA) to the list of known hosts.
00:01:01.366 0 files removed
00:01:01.856 Waiting for process with pid 6579 to finish.
00:01:27.892 CentOS Stream 8 - AppStream                     949 kB/s | 4.4 kB     00:00    
00:01:28.139 CentOS Stream 8 - BaseOS                        778 kB/s | 3.9 kB     00:00    
00:01:28.408 CentOS Stream 8 - Extras                        659 kB/s | 2.9 kB     00:00    
00:01:28.414 CentOS Stream 8 - Extras common packages        682 kB/s | 3.0 kB     00:00    
00:01:32.343 Metadata cache created.
00:01:32.881 Waiting for process with pid 6611 to finish.
00:01:34.590 Last metadata expiration check: 0:00:06 ago on Tue 02 Aug 2022 12:59:48 PM BST.
00:01:36.888 Package bash-4.4.20-4.el8.x86_64 is already installed.
00:01:36.891 Package rsync-3.1.3-14.el8.2.x86_64 is already installed.
00:01:37.057 Dependencies resolved.
00:01:37.076 =========================================================================================
00:01:37.076  Package                  Arch    Version                                Repo        Size
00:01:37.076 =========================================================================================
00:01:37.076 Installing:
00:01:37.076  git                      x86_64  2.31.1-2.el8                           appstream  161 k
00:01:37.076 Installing dependencies:
00:01:37.076  git-core                 x86_64  2.31.1-2.el8                           appstream  4.7 M
00:01:37.076  git-core-doc             noarch  2.31.1-2.el8                           appstream  2.6 M
00:01:37.076  perl-Carp                noarch  1.42-396.el8                           baseos      30 k
00:01:37.076  perl-Data-Dumper         x86_64  2.167-399.el8                          baseos      58 k
00:01:37.076  perl-Digest              noarch  1.17-395.el8                           appstream   27 k
00:01:37.076  perl-Digest-MD5          x86_64  2.55-396.el8                           appstream   37 k
00:01:37.076  perl-Encode              x86_64  4:2.97-3.el8                           baseos     1.5 M
00:01:37.076  perl-Errno               x86_64  1.28-421.el8                           baseos      76 k
00:01:37.076  perl-Error               noarch  1:0.17025-2.el8                        appstream   46 k
00:01:37.076  perl-Exporter            noarch  5.72-396.el8                           baseos      34 k
00:01:37.076  perl-File-Path           noarch  2.15-2.el8                             baseos      38 k
00:01:37.076  perl-File-Temp           noarch  0.230.600-1.el8                        baseos      63 k
00:01:37.076  perl-Getopt-Long         noarch  1:2.50-4.el8                           baseos      63 k
00:01:37.076  perl-Git                 noarch  2.31.1-2.el8                           appstream   78 k
00:01:37.076  perl-HTTP-Tiny           noarch  0.074-1.el8                            baseos      58 k
00:01:37.076  perl-IO                  x86_64  1.38-421.el8                           baseos     142 k
00:01:37.076  perl-MIME-Base64         x86_64  3.15-396.el8                           baseos      31 k
00:01:37.076  perl-Net-SSLeay          x86_64  1.88-1.module_el8.4.0+517+be1595ff     appstream  379 k
00:01:37.076  perl-PathTools           x86_64  3.74-1.el8                             baseos      90 k
00:01:37.076  perl-Pod-Escapes         noarch  1:1.07-395.el8                         baseos      20 k
00:01:37.076  perl-Pod-Perldoc         noarch  3.28-396.el8                           baseos      86 k
00:01:37.076  perl-Pod-Simple          noarch  1:3.35-395.el8                         baseos     213 k
00:01:37.076  perl-Pod-Usage           noarch  4:1.69-395.el8                         baseos      34 k
00:01:37.076  perl-Scalar-List-Utils   x86_64  3:1.49-2.el8                           baseos      68 k
00:01:37.076  perl-Socket              x86_64  4:2.027-3.el8                          baseos      59 k
00:01:37.076  perl-Storable            x86_64  1:3.11-3.el8                           baseos      98 k
00:01:37.076  perl-Term-ANSIColor      noarch  4.06-396.el8                           baseos      46 k
00:01:37.076  perl-Term-Cap            noarch  1.17-395.el8                           baseos      23 k
00:01:37.076  perl-TermReadKey         x86_64  2.37-7.el8                             appstream   40 k
00:01:37.076  perl-Text-ParseWords     noarch  3.30-395.el8                           baseos      18 k
00:01:37.076  perl-Text-Tabs+Wrap      noarch  2013.0523-395.el8                      baseos      24 k
00:01:37.076  perl-Time-Local          noarch  1:1.280-1.el8                          baseos      34 k
00:01:37.076  perl-URI                 noarch  1.73-3.el8                             appstream  116 k
00:01:37.076  perl-Unicode-Normalize   x86_64  1.25-396.el8                           baseos      82 k
00:01:37.076  perl-constant            noarch  1.33-396.el8                           baseos      25 k
00:01:37.076  perl-interpreter         x86_64  4:5.26.3-421.el8                       baseos     6.3 M
00:01:37.076  perl-libnet              noarch  3.11-3.el8                             appstream  121 k
00:01:37.076  perl-libs                x86_64  4:5.26.3-421.el8                       baseos     1.6 M
00:01:37.076  perl-macros              x86_64  4:5.26.3-421.el8                       baseos      72 k
00:01:37.076  perl-parent              noarch  1:0.237-1.el8                          baseos      20 k
00:01:37.076  perl-podlators           noarch  4.11-1.el8                             baseos     118 k
00:01:37.076  perl-threads             x86_64  1:2.21-2.el8                           baseos      61 k
00:01:37.076  perl-threads-shared      x86_64  1.58-2.el8                             baseos      48 k
00:01:37.076 Installing weak dependencies:
00:01:37.076  perl-IO-Socket-IP        noarch  0.39-5.el8                             appstream   47 k
00:01:37.076  perl-IO-Socket-SSL       noarch  2.066-4.module_el8.4.0+517+be1595ff    appstream  298 k
00:01:37.076  perl-Mozilla-CA          noarch  20160104-7.module_el8.3.0+416+dee7bcef appstream   15 k
00:01:37.076 Enabling module streams:
00:01:37.076  perl                             5.26                                                   
00:01:37.076  perl-IO-Socket-SSL               2.066                                                  
00:01:37.076  perl-libwww-perl                 6.34                                                   
00:01:37.076 
00:01:37.076 Transaction Summary
00:01:37.076 =========================================================================================
00:01:37.076 Install  47 Packages
00:01:37.076 
00:01:37.082 Total size: 20 M
00:01:37.082 Installed size: 73 M
00:01:37.084 Downloading Packages:
00:01:37.094 [SKIPPED] git-2.31.1-2.el8.x86_64.rpm: Already downloaded                      
00:01:37.094 [SKIPPED] git-core-2.31.1-2.el8.x86_64.rpm: Already downloaded                 
00:01:37.094 [SKIPPED] git-core-doc-2.31.1-2.el8.noarch.rpm: Already downloaded             
00:01:37.094 [SKIPPED] perl-Digest-1.17-395.el8.noarch.rpm: Already downloaded              
00:01:37.094 [SKIPPED] perl-Digest-MD5-2.55-396.el8.x86_64.rpm: Already downloaded          
00:01:37.094 [SKIPPED] perl-Error-0.17025-2.el8.noarch.rpm: Already downloaded              
00:01:37.094 [SKIPPED] perl-Git-2.31.1-2.el8.noarch.rpm: Already downloaded                 
00:01:37.094 [SKIPPED] perl-IO-Socket-IP-0.39-5.el8.noarch.rpm: Already downloaded          
00:01:37.094 [SKIPPED] perl-IO-Socket-SSL-2.066-4.module_el8.4.0+517+be1595ff.noarch.rpm: Already downloaded
00:01:37.095 [SKIPPED] perl-Mozilla-CA-20160104-7.module_el8.3.0+416+dee7bcef.noarch.rpm: Already downloaded
00:01:37.095 [SKIPPED] perl-Net-SSLeay-1.88-1.module_el8.4.0+517+be1595ff.x86_64.rpm: Already downloaded
00:01:37.095 [SKIPPED] perl-TermReadKey-2.37-7.el8.x86_64.rpm: Already downloaded           
00:01:37.095 [SKIPPED] perl-URI-1.73-3.el8.noarch.rpm: Already downloaded                   
00:01:37.095 [SKIPPED] perl-libnet-3.11-3.el8.noarch.rpm: Already downloaded                
00:01:37.095 [SKIPPED] perl-Carp-1.42-396.el8.noarch.rpm: Already downloaded                
00:01:37.095 [SKIPPED] perl-Data-Dumper-2.167-399.el8.x86_64.rpm: Already downloaded        
00:01:37.095 [SKIPPED] perl-Encode-2.97-3.el8.x86_64.rpm: Already downloaded                
00:01:37.096 [SKIPPED] perl-Errno-1.28-421.el8.x86_64.rpm: Already downloaded               
00:01:37.096 [SKIPPED] perl-Exporter-5.72-396.el8.noarch.rpm: Already downloaded            
00:01:37.096 [SKIPPED] perl-File-Path-2.15-2.el8.noarch.rpm: Already downloaded             
00:01:37.096 [SKIPPED] perl-File-Temp-0.230.600-1.el8.noarch.rpm: Already downloaded        
00:01:37.096 [SKIPPED] perl-Getopt-Long-2.50-4.el8.noarch.rpm: Already downloaded           
00:01:37.096 [SKIPPED] perl-HTTP-Tiny-0.074-1.el8.noarch.rpm: Already downloaded            
00:01:37.096 [SKIPPED] perl-IO-1.38-421.el8.x86_64.rpm: Already downloaded                  
00:01:37.096 [SKIPPED] perl-MIME-Base64-3.15-396.el8.x86_64.rpm: Already downloaded         
00:01:37.096 [SKIPPED] perl-PathTools-3.74-1.el8.x86_64.rpm: Already downloaded             
00:01:37.097 [SKIPPED] perl-Pod-Escapes-1.07-395.el8.noarch.rpm: Already downloaded         
00:01:37.097 [SKIPPED] perl-Pod-Perldoc-3.28-396.el8.noarch.rpm: Already downloaded         
00:01:37.097 [SKIPPED] perl-Pod-Simple-3.35-395.el8.noarch.rpm: Already downloaded          
00:01:37.097 [SKIPPED] perl-Pod-Usage-1.69-395.el8.noarch.rpm: Already downloaded           
00:01:37.097 [SKIPPED] perl-Scalar-List-Utils-1.49-2.el8.x86_64.rpm: Already downloaded     
00:01:37.097 [SKIPPED] perl-Socket-2.027-3.el8.x86_64.rpm: Already downloaded               
00:01:37.097 [SKIPPED] perl-Storable-3.11-3.el8.x86_64.rpm: Already downloaded              
00:01:37.097 [SKIPPED] perl-Term-ANSIColor-4.06-396.el8.noarch.rpm: Already downloaded      
00:01:37.097 [SKIPPED] perl-Term-Cap-1.17-395.el8.noarch.rpm: Already downloaded            
00:01:37.098 [SKIPPED] perl-Text-ParseWords-3.30-395.el8.noarch.rpm: Already downloaded     
00:01:37.098 [SKIPPED] perl-Text-Tabs+Wrap-2013.0523-395.el8.noarch.rpm: Already downloaded 
00:01:37.098 [SKIPPED] perl-Time-Local-1.280-1.el8.noarch.rpm: Already downloaded           
00:01:37.098 [SKIPPED] perl-Unicode-Normalize-1.25-396.el8.x86_64.rpm: Already downloaded   
00:01:37.098 [SKIPPED] perl-constant-1.33-396.el8.noarch.rpm: Already downloaded            
00:01:37.099 [SKIPPED] perl-interpreter-5.26.3-421.el8.x86_64.rpm: Already downloaded       
00:01:37.099 [SKIPPED] perl-libs-5.26.3-421.el8.x86_64.rpm: Already downloaded              
00:01:37.099 [SKIPPED] perl-macros-5.26.3-421.el8.x86_64.rpm: Already downloaded            
00:01:37.099 [SKIPPED] perl-parent-0.237-1.el8.noarch.rpm: Already downloaded               
00:01:37.099 [SKIPPED] perl-podlators-4.11-1.el8.noarch.rpm: Already downloaded             
00:01:37.099 [SKIPPED] perl-threads-2.21-2.el8.x86_64.rpm: Already downloaded               
00:01:37.099 [SKIPPED] perl-threads-shared-1.58-2.el8.x86_64.rpm: Already downloaded        
00:01:38.168 Running transaction check
00:01:38.168 Waiting for process with pid 6611 to finish.
00:01:48.251 Transaction check succeeded.
00:01:48.251 Running transaction test
00:01:48.521 FATAL ERROR: python callback ??? failed, aborting!
00:01:50.132 2022-08-02 12:00:09,202 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ssh -t -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ConnectTimeout=180 -o TCPKeepAlive=yes -o ServerAliveInterval=2 -l root n45.pufty.ci.centos.org systemd-centos-ci/vagrant/vagrant-ci-wrapper.sh -d 'arch-sanitizers-clang' -r 'pr:24178' 
00:01:50.138 Pseudo-terminal will not be allocated because stdin is not a terminal.
00:01:50.174 Warning: Permanently added 'n45.pufty.ci.centos.org,172.19.3.109' (ECDSA) to the list of known hosts.
00:01:50.579 [TASK-CONTROL] OPTIMAL_QEMU_SMP = 4
00:01:50.579 [TASK-CONTROL] MAX_QUEUE_SIZE = 2
00:01:50.586 Cloning repo: https://github.com/systemd/systemd.git
00:01:50.589 Cloning into 'systemd'...
00:02:12.612 Timeout, server n45.pufty.ci.centos.org not responding.
00:02:12.613 2022-08-02 12:00:31,683 [agent-control/fetch_artifacts] INFO: Fetching artifacts from node n45.pufty.ci.centos.org: (remote: ~/vagrant-logs*, local: ./artifacts_3vttpv74)
00:02:12.613 2022-08-02 12:00:31,683 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/scp -r -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no root@n45.pufty.ci.centos.org:~/vagrant-logs* ./artifacts_3vttpv74
00:03:19.942 ssh: connect to host n45.pufty.ci.centos.org port 22: No route to host

nope, it's more that somehow is still listed as available for another tenant and trying to get @nphilipp's help to understand what's going on

And just to confirm - this is not exclusive to the seamicro chassis, it happens for EC2/metal machines as well:

2022-08-03 11:07:15,273 [agent-control/allocate_node] INFO: Attempting to allocate a node from pool metal-ec2-c5n-centos-8s-x86_64
2022-08-03 11:07:16,217 [agent-control/allocate_node] INFO: Allocated node n27-31-90.pool.ci.centos.org with session id 817
2022-08-03 11:07:16,217 [agent-control/main] INFO: PHASE 1: Setting up basic dependencies to configure CI repository
2022-08-03 11:07:16,218 [agent-control/execute_remote_command] INFO: Executing a REMOTE command on node 'n27-31-90.pool.ci.centos.org': dnf clean all && dnf makecache && dnf -y install bash git rsync && rm -fr systemd-centos-ci && git clone https://github.com/systemd/systemd-centos-ci
2022-08-03 11:07:16,218 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ssh -t -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ConnectTimeout=180 -o TCPKeepAlive=yes -o ServerAliveInterval=2 -l root n27-31-90.pool.ci.centos.org dnf clean all && dnf makecache && dnf -y install bash git rsync && rm -fr systemd-centos-ci && git clone https://github.com/systemd/systemd-centos-ci
Pseudo-terminal will not be allocated because stdin is not a terminal.
Warning: Permanently added 'n27-31-90.pool.ci.centos.org,172.27.31.90' (ECDSA) to the list of known hosts.
27 files removed
CentOS Stream 8 - AppStream                      27 MB/s |  24 MB     00:00    
CentOS Stream 8 - BaseOS                         92 MB/s |  25 MB     00:00    
CentOS Stream 8 - Extras                         89 kB/s |  18 kB     00:00    
CentOS Stream 8 - Extras common packages        9.1 kB/s | 4.6 kB     00:00    
Metadata cache created.
Last metadata expiration check: 0:00:02 ago on Wed 03 Aug 2022 11:07:27 AM UTC.
Package bash-4.4.20-4.el8.x86_64 is already installed.
Package rsync-3.1.3-14.el8.2.x86_64 is already installed.
Dependencies resolved.
=========================================================================================
 Package                  Arch    Version                                Repo        Size
=========================================================================================
Installing:
 git                      x86_64  2.31.1-2.el8                           appstream  161 k
Installing dependencies:
 emacs-filesystem         noarch  1:26.1-7.el8                           baseos      70 k
 git-core                 x86_64  2.31.1-2.el8                           appstream  4.7 M
 git-core-doc             noarch  2.31.1-2.el8                           appstream  2.6 M
 perl-Carp                noarch  1.42-396.el8                           baseos      30 k
 perl-Data-Dumper         x86_64  2.167-399.el8                          baseos      58 k
 perl-Digest              noarch  1.17-395.el8                           appstream   27 k
 perl-Digest-MD5          x86_64  2.55-396.el8                           appstream   37 k
 perl-Encode              x86_64  4:2.97-3.el8                           baseos     1.5 M
 perl-Errno               x86_64  1.28-421.el8                           baseos      76 k
 perl-Error               noarch  1:0.17025-2.el8                        appstream   46 k
 perl-Exporter            noarch  5.72-396.el8                           baseos      34 k
 perl-File-Path           noarch  2.15-2.el8                             baseos      38 k
 perl-File-Temp           noarch  0.230.600-1.el8                        baseos      63 k
 perl-Getopt-Long         noarch  1:2.50-4.el8                           baseos      63 k
 perl-Git                 noarch  2.31.1-2.el8                           appstream   78 k
 perl-HTTP-Tiny           noarch  0.074-1.el8                            baseos      58 k
 perl-IO                  x86_64  1.38-421.el8                           baseos     142 k
 perl-MIME-Base64         x86_64  3.15-396.el8                           baseos      31 k
 perl-Net-SSLeay          x86_64  1.88-1.module_el8.4.0+517+be1595ff     appstream  379 k
 perl-PathTools           x86_64  3.74-1.el8                             baseos      90 k
 perl-Pod-Escapes         noarch  1:1.07-395.el8                         baseos      20 k
 perl-Pod-Perldoc         noarch  3.28-396.el8                           baseos      86 k
 perl-Pod-Simple          noarch  1:3.35-395.el8                         baseos     213 k
 perl-Pod-Usage           noarch  4:1.69-395.el8                         baseos      34 k
 perl-Scalar-List-Utils   x86_64  3:1.49-2.el8                           baseos      68 k
 perl-Socket              x86_64  4:2.027-3.el8                          baseos      59 k
 perl-Storable            x86_64  1:3.11-3.el8                           baseos      98 k
 perl-Term-ANSIColor      noarch  4.06-396.el8                           baseos      46 k
 perl-Term-Cap            noarch  1.17-395.el8                           baseos      23 k
 perl-TermReadKey         x86_64  2.37-7.el8                             appstream   40 k
 perl-Text-ParseWords     noarch  3.30-395.el8                           baseos      18 k
 perl-Text-Tabs+Wrap      noarch  2013.0523-395.el8                      baseos      24 k
 perl-Time-Local          noarch  1:1.280-1.el8                          baseos      34 k
 perl-URI                 noarch  1.73-3.el8                             appstream  116 k
 perl-Unicode-Normalize   x86_64  1.25-396.el8                           baseos      82 k
 perl-constant            noarch  1.33-396.el8                           baseos      25 k
 perl-interpreter         x86_64  4:5.26.3-421.el8                       baseos     6.3 M
 perl-libnet              noarch  3.11-3.el8                             appstream  121 k
 perl-libs                x86_64  4:5.26.3-421.el8                       baseos     1.6 M
 perl-macros              x86_64  4:5.26.3-421.el8                       baseos      72 k
 perl-parent              noarch  1:0.237-1.el8                          baseos      20 k
 perl-podlators           noarch  4.11-1.el8                             baseos     118 k
 perl-threads             x86_64  1:2.21-2.el8                           baseos      61 k
 perl-threads-shared      x86_64  1.58-2.el8                             baseos      48 k
Installing weak dependencies:
 perl-IO-Socket-IP        noarch  0.39-5.el8                             appstream   47 k
 perl-IO-Socket-SSL       noarch  2.066-4.module_el8.4.0+517+be1595ff    appstream  298 k
 perl-Mozilla-CA          noarch  20160104-7.module_el8.3.0+416+dee7bcef appstream   15 k
Enabling module streams:
 perl                             5.26                                                   
 perl-IO-Socket-SSL               2.066                                                  
 perl-libwww-perl                 6.34                                                   

Transaction Summary
=========================================================================================
Install  48 Packages

Total download size: 20 M
Installed size: 73 M
Downloading Packages:
(1/48): git-2.31.1-2.el8.x86_64.rpm             6.7 MB/s | 161 kB     00:00    
(2/48): perl-Digest-1.17-395.el8.noarch.rpm     7.1 MB/s |  27 kB     00:00    
(3/48): perl-Digest-MD5-2.55-396.el8.x86_64.rpm 7.0 MB/s |  37 kB     00:00    
(4/48): perl-Error-0.17025-2.el8.noarch.rpm      13 MB/s |  46 kB     00:00    
(5/48): perl-Git-2.31.1-2.el8.noarch.rpm         20 MB/s |  78 kB     00:00    
(6/48): perl-IO-Socket-IP-0.39-5.el8.noarch.rpm 9.4 MB/s |  47 kB     00:00    
(7/48): git-core-doc-2.31.1-2.el8.noarch.rpm     11 MB/s | 2.6 MB     00:00    
(8/48): perl-IO-Socket-SSL-2.066-4.module_el8.4 901 kB/s | 298 kB     00:00    
(9/48): perl-Mozilla-CA-20160104-7.module_el8.3 118 kB/s |  15 kB     00:00    
(10/48): perl-TermReadKey-2.37-7.el8.x86_64.rpm 9.2 MB/s |  40 kB     00:00    
(11/48): perl-Net-SSLeay-1.88-1.module_el8.4.0+ 5.5 MB/s | 379 kB     00:00    
(12/48): perl-URI-1.73-3.el8.noarch.rpm         1.7 MB/s | 116 kB     00:00    
(13/48): perl-libnet-3.11-3.el8.noarch.rpm      1.6 MB/s | 121 kB     00:00    
(14/48): emacs-filesystem-26.1-7.el8.noarch.rpm 973 kB/s |  70 kB     00:00    
Timeout, server n27-31-90.pool.ci.centos.org not responding.

Metadata Update from @zlopez:
- Issue priority set to: 🔥 Urgent 🔥 (was: Needs Review)

2 years ago

Metadata Update from @arrfab:
- Issue assigned to nphilipp

2 years ago

Quick update before the weekend: To sort out the different requests stepping on each others’ toes, I'm working on changing Duffy to retry the failing query a couple of times (before giving up) so allocating nodes gets effectively serialized for them.

Thanks for the update! I've done the same on my side this morning (i.e. retry the request if duffy.client raises httpx.HTTPStatusError) and so far it seems to be working pretty well (no more unexpected job fails). So I guess this appears to be the last issue, at least from the user point of view, regarding node/API stability

Thanks for the update! I've done the same on my side this morning (i.e. retry the request if duffy.client raises httpx.HTTPStatusError) and so far it seems to be working pretty well (no more unexpected job fails). So I guess this appears to be the last issue, at least from the user point of view, regarding node/API stability

👍 This is the workaround I'd suggest to anyone running into this problem until we have a fix deployed.

Yesterday I moved 2/3 of our jobs over to the EC2 metal machines, and since then I see some frequent intermittent issues causing the jobs to randomly fail:

2022-08-09 22:43:33,992 [agent-control/allocate_node] INFO: Attempting to allocate a node from pool metal-ec2-c5n-centos-8s-x86_64
2022-08-09 22:44:17,880 [agent-control/allocate_node] INFO: Allocated node n27-28-26.pool.ci.centos.org with session id 3406
2022-08-09 22:44:17,880 [agent-control/main] INFO: PHASE 1: Setting up basic dependencies to configure CI repository
2022-08-09 22:44:17,880 [agent-control/execute_remote_command] INFO: Executing a REMOTE command on node 'n27-28-26.pool.ci.centos.org': dnf clean all && dnf makecache && dnf -y install bash git rsync && rm -fr systemd-centos-ci && git clone https://github.com/systemd/systemd-centos-ci
2022-08-09 22:44:17,880 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ssh -t -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ConnectTimeout=180 -o TCPKeepAlive=yes -o ServerAliveInterval=2 -l root n27-28-26.pool.ci.centos.org dnf clean all && dnf makecache && dnf -y install bash git rsync && rm -fr systemd-centos-ci && git clone https://github.com/systemd/systemd-centos-ci
Pseudo-terminal will not be allocated because stdin is not a terminal.
Warning: Permanently added 'n27-28-26.pool.ci.centos.org,172.27.28.26' (ECDSA) to the list of known hosts.
0 files removed
Waiting for process with pid 2763 to finish.
CentOS Stream 8 - AppStream                      98 kB/s | 4.4 kB     00:00    
CentOS Stream 8 - BaseOS                         21 kB/s | 3.9 kB     00:00    
CentOS Stream 8 - Extras                         38 kB/s | 2.9 kB     00:00    
Metadata cache created.
Last metadata expiration check: 0:00:02 ago on Tue 09 Aug 2022 10:44:33 PM UTC.
Package bash-4.4.20-1.el8_4.x86_64 is already installed.
Package rsync-3.1.3-12.el8.x86_64 is already installed.
Dependencies resolved.
=========================================================================================
 Package                  Arch    Version                                Repo        Size
=========================================================================================
Installing:
 git                      x86_64  2.31.1-2.el8                           appstream  161 k
Upgrading:
 bash                     x86_64  4.4.20-4.el8                           baseos     1.5 M
 rsync                    x86_64  3.1.3-16.el8                           baseos     406 k

<...snip...>

(47/50): perl-threads-2.21-2.el8.x86_64.rpm     8.3 MB/s |  61 kB     00:00    
(48/50): perl-threads-shared-1.58-2.el8.x86_64. 5.6 MB/s |  48 kB     00:00    
(49/50): perl-libs-5.26.3-421.el8.x86_64.rpm     22 MB/s | 1.6 MB     00:00    
(50/50): perl-interpreter-5.26.3-421.el8.x86_64  22 MB/s | 6.3 MB     00:00    
--------------------------------------------------------------------------------
Total                                            22 MB/s |  20 MB     00:00     
Running transaction check
Waiting for process with pid 2763 to finish.
The downloaded packages were saved in cache until the next successful transaction.
You can remove cached packages by executing 'dnf clean packages'.
Traceback (most recent call last):
  File "/usr/bin/dnf", line 58, in <module>
    main.user_main(sys.argv[1:], exit_code=True)
  File "/usr/lib/python3.6/site-packages/dnf/cli/main.py", line 201, in user_main
    errcode = main(args)
  File "/usr/lib/python3.6/site-packages/dnf/cli/main.py", line 67, in main
    return _main(base, args, cli_class, option_parser_class)
  File "/usr/lib/python3.6/site-packages/dnf/cli/main.py", line 106, in _main
    return cli_run(cli, base)
  File "/usr/lib/python3.6/site-packages/dnf/cli/main.py", line 130, in cli_run
    ret = resolving(cli, base)
  File "/usr/lib/python3.6/site-packages/dnf/cli/main.py", line 176, in resolving
    base.do_transaction(display=displays)
  File "/usr/lib/python3.6/site-packages/dnf/cli/cli.py", line 245, in do_transaction
    display = [output.CliTransactionDisplay()] + list(display)
  File "/usr/lib/python3.6/site-packages/dnf/base.py", line 891, in do_transaction
    timer()
  File "/usr/lib/python3.6/site-packages/dnf/db/group.py", line 337, in _populate_rpm_ts
    ts.addErase(tsi.pkg.idx)
  File "/usr/lib64/python3.6/site-packages/rpm/transaction.py", line 109, in addErase
    raise rpm.error("package not installed")
_rpm.error: package not installed
2022-08-09 22:49:17,836 [agent-control/main] ERROR: Execution failed
2022-08-09 23:04:08,594 [agent-control/allocate_node] ERROR: [Try 168/1440] Received an API error from the server: detail="can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'"
2022-08-09 23:05:12,187 [agent-control/allocate_node] ERROR: [Try 180/1440] Received an API error from the server: detail="can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'"
2022-08-09 23:06:15,838 [agent-control/allocate_node] ERROR: [Try 192/1440] Received an API error from the server: detail="can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'"
2022-08-09 23:07:20,665 [agent-control/allocate_node] INFO: Allocated node n27-27-1.pool.ci.centos.org with session id 3681
2022-08-09 23:07:20,666 [agent-control/main] INFO: PHASE 1: Setting up basic dependencies to configure CI repository
2022-08-09 23:07:20,666 [agent-control/execute_remote_command] INFO: Executing a REMOTE command on node 'n27-27-1.pool.ci.centos.org': dnf clean all && dnf makecache && dnf -y install bash git rsync && rm -fr systemd-centos-ci && git clone https://github.com/systemd/systemd-centos-ci
2022-08-09 23:07:20,666 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ssh -t -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ConnectTimeout=180 -o TCPKeepAlive=yes -o ServerAliveInterval=2 -l root n27-27-1.pool.ci.centos.org dnf clean all && dnf makecache && dnf -y install bash git rsync && rm -fr systemd-centos-ci && git clone https://github.com/systemd/systemd-centos-ci
Pseudo-terminal will not be allocated because stdin is not a terminal.
Warning: Permanently added 'n27-27-1.pool.ci.centos.org,172.27.27.1' (ECDSA) to the list of known hosts.
0 files removed
Waiting for process with pid 2625 to finish.
CentOS Stream 8 - AppStream                      45 kB/s | 4.4 kB     00:00    
CentOS Stream 8 - BaseOS                         15 kB/s | 3.9 kB     00:00    
CentOS Stream 8 - Extras                         16 kB/s | 2.9 kB     00:00    
Metadata cache created.
Last metadata expiration check: 0:00:02 ago on Tue 09 Aug 2022 11:07:38 PM UTC.
Package bash-4.4.20-1.el8_4.x86_64 is already installed.
Package rsync-3.1.3-12.el8.x86_64 is already installed.
Dependencies resolved.
=========================================================================================
 Package                  Arch    Version                                Repo        Size
=========================================================================================
Installing:
 git                      x86_64  2.31.1-2.el8                           appstream  161 k
Upgrading:
 bash                     x86_64  4.4.20-4.el8                           baseos     1.5 M

<...snip...>

(49/50): perl-threads-shared-1.58-2.el8.x86_64. 4.1 MB/s |  48 kB     00:00    
(50/50): perl-interpreter-5.26.3-421.el8.x86_64  50 MB/s | 6.3 MB     00:00    
--------------------------------------------------------------------------------
Total                                            13 MB/s |  20 MB     00:01     
Running transaction check
Waiting for process with pid 2625 to finish.
The downloaded packages were saved in cache until the next successful transaction.
You can remove cached packages by executing 'dnf clean packages'.
[Errno 2] No such file or directory: '/var/cache/dnf/baseos-5d5072979dfa6055/packages/bash-4.4.20-4.el8.x86_64.rpm'

The dnf errors are similar to the ones where more than one session shared a machine hence stepping on each other's toes, but in this case I can't find any nodes belonging to multiple sessions on my side.

What can I see is one session stuck in a "limbo":

(duffy) [systemd@slave01 ~]$ duffy client --format=flat list-sessions | sort  
session_id=3864 active=TRUE created_at='2022-08-10 05:46:01.445507+00:00' retired_at= pool= hostname='n27-44-255.pool.ci.centos.org' ipaddr='172.27.44.255'

(missing pool and the machine doesn't belong to any job).

Also, the frequent API errors are also a bit suspicious:

2022-08-10 07:49:24,844 [agent-control/allocate_node] ERROR: [Try 12/1440] Received an API error from the server: detail="can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'"

but that might be expected? I don't know how big the EC2 metal Duffy pool is.

FTR, here are all sessions/machines which failed with one of the above errors:

2022-08-09 22:44:17,880 [agent-control/allocate_node] INFO: Allocated node n27-28-26.pool.ci.centos.org with session id 3406
2022-08-09 23:07:20,665 [agent-control/allocate_node] INFO: Allocated node n27-27-1.pool.ci.centos.org with session id 3681
2022-08-09 22:47:20,431 [agent-control/allocate_node] INFO: Allocated node n27-24-218.pool.ci.centos.org with session id 3441
2022-08-09 22:52:18,472 [agent-control/allocate_node] INFO: Allocated node n27-12-14.pool.ci.centos.org with session id 3510
2022-08-10 00:51:48,245 [agent-control/allocate_node] INFO: Allocated node n27-46-161.pool.ci.centos.org with session id 3770
2022-08-10 05:29:01,853 [agent-control/allocate_node] INFO: Allocated node n27-2-195.pool.ci.centos.org with session id 3863
2022-08-10 07:49:57,979 [agent-control/allocate_node] INFO: Allocated node n27-12-0.pool.ci.centos.org with session id 3980
2022-08-10 07:37:08,962 [agent-control/allocate_node] INFO: Allocated node n27-8-206.pool.ci.centos.org with session id 3958
2022-08-10 07:54:41,722 [agent-control/allocate_node] INFO: Allocated node n27-29-29.pool.ci.centos.org with session id 4011
2022-08-10 08:01:39,165 [agent-control/allocate_node] INFO: Allocated node n27-12-134.pool.ci.centos.org with session id 4020
2022-08-10 08:49:47,876 [agent-control/allocate_node] INFO: Allocated node n27-15-78.pool.ci.centos.org with session id 4218

I wonder if our workload is not racing against the cloud-init stuff that runs right after provisioning the machine, but that should, hopefully, be guarded by the dnf locks...

The machines aren't provisioned when you request them, but ahead of time so Duffy can hand them out quickly, only the tenant SSH keys are installed when they're allocated to your session.

No idea what's happening there. @arrfab @phsmoura, any idea?

for ec2 bare-metal, it's really expensive so we only provision one node at a time :

duffy client show-pool metal-ec2-c5n-centos-8s-x86_64
{
  "action": "get",
  "pool": {
    "name": "metal-ec2-c5n-centos-8s-x86_64",
    "fill_level": 1,
    "levels": {
      "provisioning": 0,
      "ready": 1,
      "contextualizing": 0,
      "deployed": 0,
      "deprovisioning": 0
    }
  }
}

It's true that cloud-init is doing a full update as first thing so if you request a node and its still updating in the background, that can explain this but that would mean that dnf doesn't work as it should ..
We can disable that cloud-init dnf update but that means just sometimes handing over a (very) old instance (images aren't refreshed on regular basis)

From our ansible setup :

 #cloud-config
  disable_root: False
  runcmd:
    - hostnamectl set-hostname $(hostname|sed 's/ip-172-/n/')
    - dnf update -y

Is there an easy (reliable) way to check if cloud-init is finished which tenants could check?

for ec2 bare-metal, it's really expensive so we only provision one node at a time :

duffy client show-pool metal-ec2-c5n-centos-8s-x86_64 { "action": "get", "pool": { "name": "metal-ec2-c5n-centos-8s-x86_64", "fill_level": 1, "levels": { "provisioning": 0, "ready": 1, "contextualizing": 0, "deployed": 0, "deprovisioning": 0 } } }

It's true that cloud-init is doing a full update as first thing so if you request a node and its still updating in the background, that can explain this but that would mean that dnf doesn't work as it should ..
We can disable that cloud-init dnf update but that means just sometimes handing over a (very) old instance (images aren't refreshed on regular basis)

From our ansible setup :

```

cloud-config

disable_root: False
runcmd:
- hostnamectl set-hostname $(hostname|sed 's/ip-172-/n/')
- dnf update -y
```

It looks like our job downloads the packages before cloud-init's update starts, then cloud-init's update starts, takes the lock, performs the upgrade (which contains packages our jobs just downloaded), that upgrade cleans up the downloaded packages, releases the lock, and our job fails, since the just downloaded packages aren't there. That's unfortunate (and should be handled by dnf imo).

Disabling cloud-init would make sense (we do the system update ourselves anyway), but I'm not sure if it would cause issues for other projects.

Is there an easy (reliable) way to check if cloud-init is finished which tenants could check?

Theoretically, something like

systemd-run --wait -p Wants=cloud-init.target -p After=cloud-init.target true

should do the trick (maybe there's a nicer solution). I'll try this first, so we can keep cloud-init intact.

Theoretically, something like

systemd-run --wait -p Wants=cloud-init.target -p After=cloud-init.target true

should do the trick (maybe there's a nicer solution). I'll try this first, so we can keep cloud-init intact.

So far it seems to be indeed doing the right thing, so I'll push it to production and let it bake for a couple of days.

Oh well, nevermind. That seemed to work for the first hour, but then most jobs either started timing out:

  perl-parent-1:0.237-1.el8.noarch                                              
  perl-podlators-4.11-1.el8.noarch                                              
  perl-threads-1:2.21-2.el8.x86_64                                              
  perl-threads-shared-1.58-2.el8.x86_64                                         

Complete!
Cloning into 'systemd-centos-ci'...
2022-08-11 18:23:24,842 [agent-control/main] INFO: PHASE 2: Run tests in Vagrant VMs
2022-08-11 18:23:24,843 [agent-control/execute_remote_command] INFO: Executing a REMOTE command on node 'n27-20-126.pool.ci.centos.org': systemd-centos-ci/vagrant/vagrant-ci-wrapper.sh -d 'arch' -r 'pr:24263' 
2022-08-11 18:23:24,843 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ssh -t -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ConnectTimeout=180 -o TCPKeepAlive=yes -o ServerAliveInterval=2 -l root n27-20-126.pool.ci.centos.org systemd-centos-ci/vagrant/vagrant-ci-wrapper.sh -d 'arch' -r 'pr:24263' 
Pseudo-terminal will not be allocated because stdin is not a terminal.
Warning: Permanently added 'n27-20-126.pool.ci.centos.org,172.27.20.126' (ECDSA) to the list of known hosts.
Timeout, server n27-20-126.pool.ci.centos.org not responding.

or getting timeouts from the API:

2022-08-11 18:43:10,160 [agent-control/allocate_node] ERROR: [Try 48/1440] Received an API error from the server: detail="can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'"
2022-08-11 18:43:36,053 [agent-control/main] ERROR: Execution failed
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/httpcore/_exceptions.py", line 8, in map_exceptions
    yield
  File "/usr/local/lib/python3.8/site-packages/httpcore/backends/sync.py", line 26, in read
    return self._sock.recv(max_bytes)
  File "/usr/lib64/python3.8/ssl.py", line 1226, in recv
    return self.read(buflen)
  File "/usr/lib64/python3.8/ssl.py", line 1101, in read
    return self._sslobj.read(len)
socket.timeout: The read operation timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/httpx/_transports/default.py", line 60, in map_httpcore_exceptions
    yield
  File "/usr/local/lib/python3.8/site-packages/httpx/_transports/default.py", line 218, in handle_request
    resp = self._pool.handle_request(req)
  File "/usr/local/lib/python3.8/site-packages/httpcore/_sync/connection_pool.py", line 253, in handle_request
    raise exc
  File "/usr/local/lib/python3.8/site-packages/httpcore/_sync/connection_pool.py", line 237, in handle_request
    response = connection.handle_request(request)
  File "/usr/local/lib/python3.8/site-packages/httpcore/_sync/connection.py", line 90, in handle_request
    return self._connection.handle_request(request)
  File "/usr/local/lib/python3.8/site-packages/httpcore/_sync/http11.py", line 105, in handle_request
    raise exc
  File "/usr/local/lib/python3.8/site-packages/httpcore/_sync/http11.py", line 84, in handle_request
    ) = self._receive_response_headers(**kwargs)
  File "/usr/local/lib/python3.8/site-packages/httpcore/_sync/http11.py", line 148, in _receive_response_headers
    event = self._receive_event(timeout=timeout)
  File "/usr/local/lib/python3.8/site-packages/httpcore/_sync/http11.py", line 177, in _receive_event
    data = self._network_stream.read(
  File "/usr/local/lib/python3.8/site-packages/httpcore/backends/sync.py", line 26, in read
    return self._sock.recv(max_bytes)
  File "/usr/lib64/python3.8/contextlib.py", line 131, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.8/site-packages/httpcore/_exceptions.py", line 12, in map_exceptions
    raise to_exc(exc)
httpcore.ReadTimeout: The read operation timed out

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "./agent-control.py", line 330, in main
    ac.allocate_node(args.pool)
  File "./agent-control.py", line 69, in allocate_node
    result = self._client.request_session([payload])
  File "/usr/local/lib/python3.8/site-packages/duffy/client/main.py", line 123, in request_session
    return self._query_method(
  File "/usr/local/lib/python3.8/site-packages/duffy/client/main.py", line 94, in _query_method
    response = client_method(url=url, **add_kwargs)
  File "/usr/local/lib/python3.8/site-packages/httpx/_client.py", line 1130, in post
    return self.request(
  File "/usr/local/lib/python3.8/site-packages/httpx/_client.py", line 815, in request
    return self.send(request, auth=auth, follow_redirects=follow_redirects)
  File "/usr/local/lib/python3.8/site-packages/httpx/_client.py", line 902, in send
    response = self._send_handling_auth(
  File "/usr/local/lib/python3.8/site-packages/httpx/_client.py", line 930, in _send_handling_auth
    response = self._send_handling_redirects(
  File "/usr/local/lib/python3.8/site-packages/httpx/_client.py", line 967, in _send_handling_redirects
    response = self._send_single_request(request)
  File "/usr/local/lib/python3.8/site-packages/httpx/_client.py", line 1003, in _send_single_request
    response = transport.handle_request(request)
  File "/usr/local/lib/python3.8/site-packages/httpx/_transports/default.py", line 218, in handle_request
    resp = self._pool.handle_request(req)
  File "/usr/lib64/python3.8/contextlib.py", line 131, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.8/site-packages/httpx/_transports/default.py", line 77, in map_httpcore_exceptions
    raise mapped_exc(message) from exc
httpx.ReadTimeout: The read operation timed out

After another round of debugging, this particular issue is caused by systemd (oh no). Long story short: as the AMIs are quite outdated, the system update performed by cloud-init bumps the systemd package several versions ahead, one of them introducing a backward-incompatible change for the systemd user units. This, in combination with the cloud-init workaround above causes an error when the ssh session ends, which triggers a session restart just in time when we reconnect to the ssh for the second time, killing the ssh connection and causing the test to fail:

[  379.336057] n27-16-16.pool.ci.centos.org sshd[5604]: Disconnected from user root 172.19.0.16 port 58460
[  379.338082] n27-16-16.pool.ci.centos.org sshd[5463]: pam_unix(sshd:session): session closed for user root
[  379.340102] n27-16-16.pool.ci.centos.org systemd-logind[1923]: Session 5 logged out. Waiting for processes to exit.
[  379.340327] n27-16-16.pool.ci.centos.org systemd[1]: session-5.scope: Succeeded.
[  379.341208] n27-16-16.pool.ci.centos.org systemd-logind[1923]: Removed session 5.
[  379.342297] n27-16-16.pool.ci.centos.org systemd[1]: Stopping User Manager for UID 0...
[  379.343138] n27-16-16.pool.ci.centos.org systemd[5562]: /usr/lib/systemd/user/systemd-exit.service:16: Failed to parse failure action specifier, ignoring: exit-force
[  379.343488] n27-16-16.pool.ci.centos.org systemd[5562]: systemd-exit.service: Service lacks both ExecStart= and ExecStop= setting. Refusing.
[  379.343525] n27-16-16.pool.ci.centos.org systemd[5562]: Failed to enqueue exit.target job: Unit systemd-exit.service has a bad unit file setting.
[  379.941173] n27-16-16.pool.ci.centos.org sshd[63794]: Accepted publickey for root from 172.19.0.16 port 58582 ssh2: RSA SHA256:XI6P95wU3TyZqN8YTWLscn48i9t+dMg1xgN9BgfCmDo
[  379.947015] n27-16-16.pool.ci.centos.org systemd-logind[1923]: New session 7 of user root.
[  379.947295] n27-16-16.pool.ci.centos.org sshd[63794]: pam_systemd(sshd:session): Failed to create session: Start job for unit user-0.slice failed with 'canceled'

This is a long-standing issue (i.e. it's still marked as FIXME even in the Fedora systemd specfile: https://src.fedoraproject.org/rpms/systemd/blob/rawhide/f/systemd.spec#_797). Fix should, hopefully, be fairly trivial, by extending the previous workaround a bit:

systemd-run --wait -t -p Wants=cloud-init.target -p After=cloud-init.target true && systemctl --user daemon-reexec

I'll give it a try (and hopefully it's the last issue for now).

I don't want to jinx it, but with:

systemd-run --wait -p Wants=cloud-init.target -p After=cloud-init.target true && systemctl --user daemon-reexec

all involved parties are finally content and the jobs don't fail anymore.

The workaround is a bit ugly, since we miss some upstream systemd patches in C8S which would allow it to be a single command, e.g.:

systemd-run --wait -p Wants=cloud-init.target -p After=cloud-init.target systemctl -M root@.host --user daemon-reexec

Another option would be:

systemd-run --wait -E DBUS_SESSION_BUS_ADDRESS=$DBUS_SESSION_BUS_ADDRESS -p Wants=cloud-init.target -p After=cloud-init.target systemctl --user daemon-reexec

but SELinux complains about this one, so it's a no-go again.

So, after moving all upstream jobs over to the metal machines, I noticed that Duffy only provisions one metal machine at a time and returns can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64' for all other API requests during that time. I know (after speaking with @arrfab), that only one machine is "pre-provisioned" in the metal pools, since the machines are expensive (which is totally okay), but if you have multiple requests (even from a single project) trying to request a metal machine, it can take a long time of going back and forth with Duffy before it actually provisions a machine, given that it only provisions one machine at the time.

For example, after spawning ten jobs (the entire project quota ATM), all for the metal machines, it took the last job ~30 minutes before Duffy got through all the other provisions and accepted the request:

00:01:21.556 2022-08-18 17:59:11,243 [agent-control/allocate_node] INFO: Attempting to allocate a node from pool metal-ec2-c5n-centos-8s-x86_64
00:02:24.819 2022-08-18 18:00:14,505 [agent-control/allocate_node] ERROR: [Try 12/1440] Received an API error from the server: detail="can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'"
00:03:42.296 2022-08-18 18:01:31,980 [agent-control/allocate_node] ERROR: [Try 24/1440] Received an API error from the server: detail="can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'"
00:04:53.520 2022-08-18 18:02:43,205 [agent-control/allocate_node] ERROR: [Try 36/1440] Received an API error from the server: detail="can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'"
00:06:07.031 2022-08-18 18:03:56,715 [agent-control/allocate_node] ERROR: [Try 48/1440] Received an API error from the server: detail="can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'"
00:07:21.300 2022-08-18 18:05:10,985 [agent-control/allocate_node] ERROR: [Try 60/1440] Received an API error from the server: detail="can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'"
00:08:30.040 2022-08-18 18:06:19,726 [agent-control/allocate_node] ERROR: [Try 72/1440] Received an API error from the server: detail="can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'"
00:09:39.719 2022-08-18 18:07:29,404 [agent-control/allocate_node] ERROR: [Try 84/1440] Received an API error from the server: detail="can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'"
00:10:52.893 2022-08-18 18:08:42,579 [agent-control/allocate_node] ERROR: [Try 96/1440] Received an API error from the server: detail="can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'"
00:12:10.151 2022-08-18 18:09:59,837 [agent-control/allocate_node] ERROR: [Try 108/1440] Received an API error from the server: detail="can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'"
00:13:26.832 2022-08-18 18:11:16,517 [agent-control/allocate_node] ERROR: [Try 120/1440] Received an API error from the server: detail="can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'"
00:14:43.728 2022-08-18 18:12:33,413 [agent-control/allocate_node] ERROR: [Try 132/1440] Received an API error from the server: detail="can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'"
00:15:59.154 2022-08-18 18:13:48,778 [agent-control/allocate_node] ERROR: [Try 144/1440] Received an API error from the server: detail="can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'"
00:17:10.876 2022-08-18 18:15:00,562 [agent-control/allocate_node] ERROR: [Try 156/1440] Received an API error from the server: detail="can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'"
00:18:17.324 2022-08-18 18:16:07,009 [agent-control/allocate_node] ERROR: [Try 168/1440] Received an API error from the server: detail="can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'"
00:19:23.845 2022-08-18 18:17:13,530 [agent-control/allocate_node] ERROR: [Try 180/1440] Received an API error from the server: detail="can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'"
00:20:34.989 2022-08-18 18:18:24,676 [agent-control/allocate_node] ERROR: [Try 192/1440] Received an API error from the server: detail="can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'"
00:21:42.599 2022-08-18 18:19:32,284 [agent-control/allocate_node] ERROR: [Try 204/1440] Received an API error from the server: detail="can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'"
00:22:49.741 2022-08-18 18:20:39,427 [agent-control/allocate_node] ERROR: [Try 216/1440] Received an API error from the server: detail="can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'"
00:23:56.153 2022-08-18 18:21:45,839 [agent-control/allocate_node] ERROR: [Try 228/1440] Received an API error from the server: detail="can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'"
00:25:05.723 2022-08-18 18:22:55,409 [agent-control/allocate_node] ERROR: [Try 240/1440] Received an API error from the server: detail="can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'"
00:26:16.080 2022-08-18 18:24:05,766 [agent-control/allocate_node] ERROR: [Try 252/1440] Received an API error from the server: detail="can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'"
00:27:25.813 2022-08-18 18:25:15,498 [agent-control/allocate_node] ERROR: [Try 264/1440] Received an API error from the server: detail="can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'"
00:28:36.028 2022-08-18 18:26:25,713 [agent-control/allocate_node] ERROR: [Try 276/1440] Received an API error from the server: detail="can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'"
00:29:44.501 2022-08-18 18:27:34,186 [agent-control/allocate_node] ERROR: [Try 288/1440] Received an API error from the server: detail="can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'"
00:30:51.414 2022-08-18 18:28:41,100 [agent-control/allocate_node] ERROR: [Try 300/1440] Received an API error from the server: detail="can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'"
00:31:48.194 2022-08-18 18:29:37,880 [agent-control/allocate_node] INFO: Allocated node n27-36-189.pool.ci.centos.org with session id 13530

Is this expected? I'm afraid the delay will get even worse when other projects migrate over to the metal machines, since currently it looks only systemd is utilizing the pool (at least given the output of duffy client show-pool metal-ec2-c5n-centos-8s-x86_64).

On an unrelated note - I noticed another session stuck in a "limbo" - the machine is no longer accessible, the pool field is empty, but the session still exists:

(duffy) [systemd@slave01 ~]$ duffy client show-session 16583
{
  "action": "get",
  "session": {
    "retired_at": null,
    "active": true,
    "created_at": "2022-08-19T15:00:36.042719+00:00",
    "id": 16583,
    "expires_at": "2022-08-19T21:00:36.871262+00:00",
    "tenant": {
      "retired_at": null,
      "active": true,
      "created_at": "2022-06-23T13:25:27.880793+00:00",
      "name": "systemd",
      "is_admin": false,
      "ssh_key": "**********",
      "node_quota": null,
      "session_lifetime": null,
      "session_lifetime_max": null,
      "id": 5,
      "effective_node_quota": 10,
      "effective_session_lifetime": 21600.0,
      "effective_session_lifetime_max": 43200.0
    },
    "data": {
      "nodes_specs": [
        {
          "quantity": 1,
          "pool": "metal-ec2-c5n-centos-8s-x86_64"
        }
      ]
    },
    "nodes": [
      {
        "hostname": "n27-25-58.pool.ci.centos.org",
        "ipaddr": "172.27.25.58",
        "comment": null,
        "pool": null,
        "reusable": false,
        "data": {
          "provision": {
            "ec2_instance_id": "i-0b991b4609f71c7df",
            "ec2_instance_type": "c5n.metal",
            "hostname": "n27-25-58.pool.ci.centos.org",
            "ipaddr": "172.27.25.58",
            "public_hostname": "ec2-54-81-45-130.compute-1.amazonaws.com",
            "public_ipaddress": "54.81.45.130"
          },
          "nodes_spec": {
            "quantity": 1,
            "pool": "metal-ec2-c5n-centos-8s-x86_64"
          }
        },
        "id": 1448,
        "state": "done"
      }
    ]
  }
}

A quick update: two more machines found stuck in a limbo:

session_id=19618 active=TRUE created_at='2022-08-20 16:40:42.587697+00:00' retired_at= pool= hostname='n27-9-18.pool.ci.centos.org' ipaddr='172.27.9.18'
session_id=19658 active=TRUE created_at='2022-08-20 16:42:33.172434+00:00' retired_at= pool= hostname='n27-5-55.pool.ci.centos.org' ipaddr='172.27.5.55'

Also, Duffy has been stuck for two hours provisioning a metal machine (metal-ec2-c5n-centos-8s-x86_64) and the API requests now time out quite frequently, not sure what's going on...

had a quick look at duffy-tasks (provisioner) and it's now tracebacking on something else :

Aug 21 06:02:25 duffy.ci.centos.org duffy[2515230]: kombu.exceptions.OperationalError: MISCONF Redis is configured to save RDB snapshots, but it is currently not able to persist on disk. Commands t> Aug 21 06:02:25 duffy.ci.centos.org duffy[2515230]: During handling of the above exception, another exception occurred: Aug 21 06:02:25 duffy.ci.centos.org duffy[2515230]: Traceback (most recent call last): Aug 21 06:02:25 duffy.ci.centos.org duffy[2515230]: File "/var/lib/duffy/venv/lib64/python3.8/site-packages/celery/beat.py", line 287, in apply_entry Aug 21 06:02:25 duffy.ci.centos.org duffy[2515230]: result = self.apply_async(entry, producer=producer, advance=False) Aug 21 06:02:25 duffy.ci.centos.org duffy[2515230]: File "/var/lib/duffy/venv/lib64/python3.8/site-packages/celery/beat.py", line 412, in apply_async Aug 21 06:02:25 duffy.ci.centos.org duffy[2515230]: reraise(SchedulingError, SchedulingError( Aug 21 06:02:25 duffy.ci.centos.org duffy[2515230]: File "/var/lib/duffy/venv/lib64/python3.8/site-packages/celery/exceptions.py", line 108, in reraise Aug 21 06:02:25 duffy.ci.centos.org duffy[2515230]: raise value.with_traceback(tb) Aug 21 06:02:25 duffy.ci.centos.org duffy[2515230]: File "/var/lib/duffy/venv/lib64/python3.8/site-packages/celery/beat.py", line 404, in apply_async Aug 21 06:02:25 duffy.ci.centos.org duffy[2515230]: return task.apply_async(entry_args, entry_kwargs, Aug 21 06:02:25 duffy.ci.centos.org duffy[2515230]: File "/var/lib/duffy/venv/lib64/python3.8/site-packages/celery/app/task.py", line 575, in apply_async Aug 21 06:02:25 duffy.ci.centos.org duffy[2515230]: return app.send_task( Aug 21 06:02:25 duffy.ci.centos.org duffy[2515230]: File "/var/lib/duffy/venv/lib64/python3.8/site-packages/celery/app/base.py", line 788, in send_task Aug 21 06:02:25 duffy.ci.centos.org duffy[2515230]: amqp.send_task_message(P, name, message, **options) Aug 21 06:02:25 duffy.ci.centos.org duffy[2515230]: File "/var/lib/duffy/venv/lib64/python3.8/site-packages/celery/app/amqp.py", line 510, in send_task_message Aug 21 06:02:25 duffy.ci.centos.org duffy[2515230]: ret = producer.publish( Aug 21 06:02:25 duffy.ci.centos.org duffy[2515230]: File "/var/lib/duffy/venv/lib64/python3.8/site-packages/kombu/messaging.py", line 177, in publish Aug 21 06:02:25 duffy.ci.centos.org duffy[2515230]: return _publish( Aug 21 06:02:25 duffy.ci.centos.org duffy[2515230]: File "/var/lib/duffy/venv/lib64/python3.8/site-packages/kombu/connection.py", line 556, in _ensured Aug 21 06:02:25 duffy.ci.centos.org duffy[2515230]: errback and errback(exc, 0) Aug 21 06:02:25 duffy.ci.centos.org duffy[2515230]: File "/usr/lib64/python3.8/contextlib.py", line 131, in __exit__ Aug 21 06:02:25 duffy.ci.centos.org duffy[2515230]: self.gen.throw(type, value, traceback) Aug 21 06:02:25 duffy.ci.centos.org duffy[2515230]: File "/var/lib/duffy/venv/lib64/python3.8/site-packages/kombu/connection.py", line 450, in _reraise_as_library_errors Aug 21 06:02:25 duffy.ci.centos.org duffy[2515230]: raise ConnectionError(str(exc)) from exc Aug 21 06:02:25 duffy.ci.centos.org duffy[2515230]: celery.beat.SchedulingError: Couldn't apply scheduled task duffy.tasks.provision.fill_pools(): MISCONF Redis is configured to save RDB snapshots

@nphilipp ^

Also : I guess it would be better to report all Duffy issues on his dedicated tracker (https://github.com/CentOS/duffy/issues) or just continuing here ?

Just quick update : that one is caused again by this reported issue. I see it was closed by @nphilipp upstream but not yet in a release so not yet deployed on duffy.ci.centos.org

Thanks a lot for today's quick fix!

Also : I guess it would be better to report all Duffy issues on his dedicated tracker (https://github.com/CentOS/duffy/issues) or just continuing here ?

That makes sense, I filed the machine-stuck-in-a-limbo issue as https://github.com/CentOS/duffy/issues/558.

I've just released 3.3.4 with the accrued fixes and updated Duffy on n23:

[nphilipp@n23 ~]$ duffy -c .config/duffy client show-pool virt-one-small-centos-9s-x86_64            
{
  "action": "get",
  "pool": {
    "name": "virt-one-small-centos-9s-x86_64",
    "fill_level": 3,
    "levels": {
      "provisioning": 0,
      "ready": 3,
      "contextualizing": 0,
      "deployed": 0,
      "deprovisioning": 0
    }                                                                                                
  }
}
[nphilipp@n23 ~]$ for i in 1 2; do duffy -c ~/.config/duffy/ client request-session pool=virt-one-sma$
l-centos-9s-x86_64,quantity=1 > duffy-$i.out & done; wait                                            
[1] 3938193
[2] 3938194
[1]-  Done                    duffy -c ~/.config/duffy/ client request-session pool=virt-one-small-ce$
tos-9s-x86_64,quantity=1 > duffy-$i.out
[2]+  Done                    duffy -c ~/.config/duffy/ client request-session pool=virt-one-small-ce$
tos-9s-x86_64,quantity=1 > duffy-$i.out
[nphilipp@n23 ~]$ diff -u duffy-{1,2}.out
--- duffy-1.out 2022-08-22 08:30:08.529854474 +0000
+++ duffy-2.out 2022-08-22 08:30:07.485906704 +0000
@@ -3,9 +3,9 @@
   "session": {
     "retired_at": null,
     "active": true,
-    "created_at": "2022-08-22T08:30:07.870737+00:00",
-    "id": 62,
-    "expires_at": "2022-08-22T14:30:07.871686+00:00",
+    "created_at": "2022-08-22T08:30:06.249896+00:00",
+    "id": 60,
+    "expires_at": "2022-08-22T14:30:06.542881+00:00",
     "tenant": {
       "retired_at": null,
       "active": true,
@@ -31,24 +31,24 @@
     },
     "nodes": [
       {
-        "hostname": "n50-74.cloud.ci.centos.org",
-        "ipaddr": "10.234.50.74",
+        "hostname": "n50-73.cloud.ci.centos.org",
+        "ipaddr": "10.234.50.73",
         "comment": null,
         "pool": "virt-one-small-centos-9s-x86_64",
         "reusable": false,
         "data": {
           "provision": {
             "opennebula": {
-              "id": 169
+              "id": 168
             },
-            "ipaddr": "10.234.50.74"
+            "ipaddr": "10.234.50.73"
           },
           "nodes_spec": {
             "quantity": 1,
             "pool": "virt-one-small-centos-9s-x86_64"
           }
         },
-        "id": 383,
+        "id": 382,
         "state": "deployed"
       }
     ]
[nphilipp@n23 ~]$

@arrfab would you please install the update in prod? Also, users of duffy client experiencing this issue should update to 3.3.4, too, because previous versions might run into timeouts on the 2nd (3rd, ...) parallel request if the API retries the DB transaction – the httpx package used has a default request timeout of 5 seconds which might be too short for this case, 3.3.4 sets the timeout to infinite.

Pushed in git/inventory and applied by ansible so duffy.ci.centos.org is now running 3.3.4
In parallel, the cico-workspace container was rebuilt on quay.io and so will be also used for next pods/jobs on jenkins instances (to reflect also that specific duffy[client] version)

@mrc0mmand : do you see any improvement in the last ~24h ? We'd like to hear from you and get some feedback and (eventually) close this ticket if that's now resolved :)

Everything is looking good from my side - the API appears to be stable and I haven't seen any stuck session. I'll go ahead and close this ticket, since it's kinda long already and all relevant issues have been either resolved or reported to the GH issue tracker.

Thanks a bunch to everyone involved!

Metadata Update from @mrc0mmand:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

2 years ago

Login to comment on this ticket.

Metadata
Boards 1
CentOS CI Infra Status: Backlog