Today I noticed that some of the Duffy nodes are a bit unstable - is this related to the "shuffling-around" caused by the pool migration? It doesn't matter which Duffy API I use:
2022-08-01 09:28:48,790 [agent-control/allocate_node] INFO: Attempting to allocate a node (version: 8-stream, arch: x86_64, flavor: n/a) 2022-08-01 09:28:48,790 [agent-control/_execute_api_command] INFO: Duffy request URL: http://admin.ci.centos.org:8080/Node/get 2022-08-01 09:28:49,885 [agent-control/allocate_node] INFO: Successfully allocated node 'n49.pufty' (121) 2022-08-01 09:28:49,886 [agent-control/main] INFO: PHASE 1: Setting up basic dependencies to configure CI repository 2022-08-01 09:28:49,886 [agent-control/execute_remote_command] INFO: Executing a REMOTE command on node 'n49.pufty': dnf clean all && dnf makecache && dnf -y install bash git rsync && rm -fr systemd-centos-ci && git clone https://github.com/systemd/systemd-centos-ci 2022-08-01 09:28:49,886 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ssh -t -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ConnectTimeout=180 -o TCPKeepAlive=yes -o ServerAliveInterval=2 -l root n49.pufty dnf clean all && dnf makecache && dnf -y install bash git rsync && rm -fr systemd-centos-ci && git clone https://github.com/systemd/systemd-centos-ci Pseudo-terminal will not be allocated because stdin is not a terminal. Warning: Permanently added 'n49.pufty,172.19.3.113' (ECDSA) to the list of known hosts. 27 files removed CentOS Stream 8 - AppStream 51 MB/s | 24 MB 00:00 failed to open: /var/cache/dnf/appstream-773ef6463612e8e2/repodata/2f5f47034f44cea88724f5ed0beaa3027732a937966ad51a180a4cd8986b89bb-filelists.xml.gz 2022-08-01 09:28:53,851 [agent-control/main] ERROR: Execution failed
2022-08-01 16:51:34,367 [agent-control/allocate_node] INFO: Attempting to allocate a node from pool metal-seamicro-large-centos-8s-x86_64 2022-08-01 16:51:35,388 [agent-control/allocate_node] INFO: Allocated node n19.gusty.ci.centos.org with session id 259 2022-08-01 16:51:35,388 [agent-control/main] INFO: PHASE 1: Setting up basic dependencies to configure CI repository 2022-08-01 16:51:35,388 [agent-control/execute_remote_command] INFO: Executing a REMOTE command on node 'n19.gusty.ci.centos.org': dnf clean all && dnf makecache && dnf -y install bash git rsync && rm -fr systemd-centos-ci && git clone https://github.com/systemd/systemd-centos-ci 2022-08-01 16:51:35,388 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ssh -t -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ConnectTimeout=180 -o TCPKeepAlive=yes -o ServerAliveInterval=2 -l root n19.gusty.ci.centos.org dnf clean all && dnf makecache && dnf -y install bash git rsync && rm -fr systemd-centos-ci && git clone https://github.com/systemd/systemd-centos-ci Pseudo-terminal will not be allocated because stdin is not a terminal. Warning: Permanently added 'n19.gusty.ci.centos.org,172.19.2.147' (ECDSA) to the list of known hosts. 0 files removed CentOS Stream 8 - AppStream 62 MB/s | 24 MB 00:00 failed to open: /var/cache/dnf/appstream-773ef6463612e8e2/repodata/2f5f47034f44cea88724f5ed0beaa3027732a937966ad51a180a4cd8986b89bb-filelists.xml.gz 2022-08-01 16:51:39,443 [agent-control/main] ERROR: Execution failed
2022-08-01 16:51:34,205 [agent-control/allocate_node] INFO: Attempting to allocate a node from pool metal-seamicro-large-centos-8s-x86_64 2022-08-01 16:51:35,375 [agent-control/allocate_node] INFO: Allocated node n19.gusty.ci.centos.org with session id 258 2022-08-01 16:51:35,375 [agent-control/main] INFO: PHASE 1: Setting up basic dependencies to configure CI repository 2022-08-01 16:51:35,375 [agent-control/execute_remote_command] INFO: Executing a REMOTE command on node 'n19.gusty.ci.centos.org': dnf clean all && dnf makecache && dnf -y install bash git rsync && rm -fr systemd-centos-ci && git clone https://github.com/systemd/systemd-centos-ci 2022-08-01 16:51:35,375 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ssh -t -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ConnectTimeout=180 -o TCPKeepAlive=yes -o ServerAliveInterval=2 -l root n19.gusty.ci.centos.org dnf clean all && dnf makecache && dnf -y install bash git rsync && rm -fr systemd-centos-ci && git clone https://github.com/systemd/systemd-centos-ci Pseudo-terminal will not be allocated because stdin is not a terminal. Warning: Permanently added 'n19.gusty.ci.centos.org,172.19.2.147' (ECDSA) to the list of known hosts. 6 files removed Waiting for process with pid 6620 to finish. CentOS Stream 8 - AppStream 62 MB/s | 24 MB 00:00 Timeout, server n19.gusty.ci.centos.org not responding.
/cc @arrfab
Metadata Update from @arrfab: - Issue tagged with: centos-ci-infra, high-gain, investigation, medium-trouble
for the gusty one it's something I had an issue yesterday with, so I had to ask for a full chassis reset (again). but for the other ones, that seems to be different, as you have access but it's like if there was a corrupted metadata on it ? (strange) Do you see that behavior again today ?
gusty
Unfortunately, yes:
2022-08-02 08:30:35,339 [agent-control/allocate_node] INFO: Attempting to allocate a node from pool metal-seamicro-large-centos-8s-x86_64 2022-08-02 08:30:36,451 [agent-control/allocate_node] INFO: Allocated node n28.gusty.ci.centos.org with session id 403 2022-08-02 08:30:36,452 [agent-control/main] INFO: PHASE 1: Setting up basic dependencies to configure CI repository 2022-08-02 08:30:36,452 [agent-control/execute_remote_command] INFO: Executing a REMOTE command on node 'n28.gusty.ci.centos.org': dnf clean all && dnf makecache && dnf -y install bash git rsync && rm -fr systemd-centos-ci && git clone https://github.com/systemd/systemd-centos-ci 2022-08-02 08:30:36,452 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ssh -t -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ConnectTimeout=180 -o TCPKeepAlive=yes -o ServerAliveInterval=2 -l root n28.gusty.ci.centos.org dnf clean all && dnf makecache && dnf -y install bash git rsync && rm -fr systemd-centos-ci && git clone https://github.com/systemd/systemd-centos-ci Pseudo-terminal will not be allocated because stdin is not a terminal. Warning: Permanently added 'n28.gusty.ci.centos.org,172.19.2.156' (ECDSA) to the list of known hosts. Waiting for process with pid 11712 to finish. 6 files removed CentOS Stream 8 - AppStream 59 MB/s | 24 MB 00:00 Timeout, server n28.gusty.ci.centos.org not responding. 2022-08-02 08:30:50,980 [agent-control/main] ERROR: Execution failed
2022-08-02 08:30:35,265 [agent-control/allocate_node] INFO: Attempting to allocate a node from pool metal-seamicro-large-centos-8s-x86_64 2022-08-02 08:30:36,437 [agent-control/allocate_node] INFO: Allocated node n28.gusty.ci.centos.org with session id 402 2022-08-02 08:30:36,437 [agent-control/main] INFO: PHASE 1: Setting up basic dependencies to configure CI repository 2022-08-02 08:30:36,437 [agent-control/execute_remote_command] INFO: Executing a REMOTE command on node 'n28.gusty.ci.centos.org': dnf clean all && dnf makecache && dnf -y install bash git rsync && rm -fr systemd-centos-ci && git clone https://github.com/systemd/systemd-centos-ci 2022-08-02 08:30:36,437 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ssh -t -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ConnectTimeout=180 -o TCPKeepAlive=yes -o ServerAliveInterval=2 -l root n28.gusty.ci.centos.org dnf clean all && dnf makecache && dnf -y install bash git rsync && rm -fr systemd-centos-ci && git clone https://github.com/systemd/systemd-centos-ci Pseudo-terminal will not be allocated because stdin is not a terminal. Warning: Permanently added 'n28.gusty.ci.centos.org,172.19.2.156' (ECDSA) to the list of known hosts. 27 files removed CentOS Stream 8 - AppStream 60 MB/s | 24 MB 00:00 failed to open: /var/cache/dnf/appstream-773ef6463612e8e2/repodata/2f5f47034f44cea88724f5ed0beaa3027732a937966ad51a180a4cd8986b89bb-filelists.xml.gz 2022-08-02 08:30:40,375 [agent-control/main] ERROR: Execution failed
Also, this is also happening from time to time since yesterday:
2022-08-02 08:26:31,035 [agent-control/allocate_node] INFO: Attempting to allocate a node from pool metal-seamicro-large-centos-8s-x86_64 2022-08-02 08:26:32,193 [agent-control/allocate_node] INFO: Allocated node n23.pufty.ci.centos.org with session id 396 2022-08-02 08:26:32,193 [agent-control/main] INFO: PHASE 1: Setting up basic dependencies to configure CI repository 2022-08-02 08:26:32,193 [agent-control/execute_remote_command] INFO: Executing a REMOTE command on node 'n23.pufty.ci.centos.org': dnf clean all && dnf makecache && dnf -y install bash git rsync && rm -fr systemd-centos-ci && git clone https://github.com/systemd/systemd-centos-ci 2022-08-02 08:26:32,193 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ssh -t -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ConnectTimeout=180 -o TCPKeepAlive=yes -o ServerAliveInterval=2 -l root n23.pufty.ci.centos.org dnf clean all && dnf makecache && dnf -y install bash git rsync && rm -fr systemd-centos-ci && git clone https://github.com/systemd/systemd-centos-ci Pseudo-terminal will not be allocated because stdin is not a terminal. Warning: Permanently added 'n23.pufty.ci.centos.org,172.19.3.87' (ECDSA) to the list of known hosts. FATAL: command execution failed java.nio.channels.ClosedChannelException at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.onReadClosed(ChannelApplicationLayer.java:209) at org.jenkinsci.remoting.protocol.ApplicationLayer.onRecvClosed(ApplicationLayer.java:221) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:816) at org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:287) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:181) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.switchToNoSecure(SSLEngineFilterLayer.java:283) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processWrite(SSLEngineFilterLayer.java:503) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processQueuedWrites(SSLEngineFilterLayer.java:248) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doSend(SSLEngineFilterLayer.java:200) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doCloseSend(SSLEngineFilterLayer.java:213) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doCloseSend(ProtocolStack.java:784) at org.jenkinsci.remoting.protocol.ApplicationLayer.doCloseWrite(ApplicationLayer.java:172) at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer$ByteBufferCommandTransport.closeWrite(ChannelApplicationLayer.java:314) at hudson.remoting.Channel.close(Channel.java:1493) at hudson.remoting.Channel.close(Channel.java:1446) at hudson.slaves.SlaveComputer.closeChannel(SlaveComputer.java:872) at hudson.slaves.SlaveComputer.access$100(SlaveComputer.java:113) at hudson.slaves.SlaveComputer$2.run(SlaveComputer.java:763) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834) Caused: java.io.IOException: Backing channel 'JNLP4-connect connection from 10.129.2.54/10.129.2.54:35034' is disconnected. at hudson.remoting.RemoteInvocationHandler.channelOrFail(RemoteInvocationHandler.java:216) at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:285) at com.sun.proxy.$Proxy113.isAlive(Unknown Source) at hudson.Launcher$RemoteLauncher$ProcImpl.isAlive(Launcher.java:1147) at hudson.Launcher$RemoteLauncher$ProcImpl.join(Launcher.java:1139) at hudson.tasks.CommandInterpreter.join(CommandInterpreter.java:155) at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:109) at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:66) at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20) at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:741) at hudson.model.Build$BuildExecution.build(Build.java:206) at hudson.model.Build$BuildExecution.doRun(Build.java:163) at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:504) at hudson.model.Run.execute(Run.java:1880) at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43) at hudson.model.ResourceController.execute(ResourceController.java:97) at hudson.model.Executor.run(Executor.java:428) FATAL: Unable to delete script file /tmp/jenkins5743870364005224433.sh java.nio.channels.ClosedChannelException at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.onReadClosed(ChannelApplicationLayer.java:209) at org.jenkinsci.remoting.protocol.ApplicationLayer.onRecvClosed(ApplicationLayer.java:221) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:816) at org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:287) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:181) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.switchToNoSecure(SSLEngineFilterLayer.java:283) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processWrite(SSLEngineFilterLayer.java:503) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processQueuedWrites(SSLEngineFilterLayer.java:248) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doSend(SSLEngineFilterLayer.java:200) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doCloseSend(SSLEngineFilterLayer.java:213) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doCloseSend(ProtocolStack.java:784) at org.jenkinsci.remoting.protocol.ApplicationLayer.doCloseWrite(ApplicationLayer.java:172) at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer$ByteBufferCommandTransport.closeWrite(ChannelApplicationLayer.java:314) at hudson.remoting.Channel.close(Channel.java:1493) at hudson.remoting.Channel.close(Channel.java:1446) at hudson.slaves.SlaveComputer.closeChannel(SlaveComputer.java:872) at hudson.slaves.SlaveComputer.access$100(SlaveComputer.java:113) at hudson.slaves.SlaveComputer$2.run(SlaveComputer.java:763) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834) Caused: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@49b0491b:JNLP4-connect connection from 10.129.2.54/10.129.2.54:35034": Remote call on JNLP4-connect connection from 10.129.2.54/10.129.2.54:35034 failed. The channel is closing down or has closed down at hudson.remoting.Channel.call(Channel.java:991) at hudson.FilePath.act(FilePath.java:1069) at hudson.FilePath.act(FilePath.java:1058) at hudson.FilePath.delete(FilePath.java:1543) at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:123) at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:66) at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20) at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:741) at hudson.model.Build$BuildExecution.build(Build.java:206) at hudson.model.Build$BuildExecution.doRun(Build.java:163) at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:504) at hudson.model.Run.execute(Run.java:1880) at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43) at hudson.model.ResourceController.execute(ResourceController.java:97) at hudson.model.Executor.run(Executor.java:428) Build step 'Execute shell' marked build as failure
not sure if related.
And it doesn't seem to be gusty-specific:
2022-08-02 11:12:12,626 [agent-control/allocate_node] INFO: Attempting to allocate a node from pool metal-seamicro-large-centos-8s-x86_64 2022-08-02 11:12:14,003 [agent-control/allocate_node] INFO: Allocated node n53.pufty.ci.centos.org with session id 434 2022-08-02 11:12:14,003 [agent-control/main] INFO: PHASE 1: Setting up basic dependencies to configure CI repository 2022-08-02 11:12:14,003 [agent-control/execute_remote_command] INFO: Executing a REMOTE command on node 'n53.pufty.ci.centos.org': dnf clean all && dnf makecache && dnf -y install bash git rsync && rm -fr systemd-centos-ci && git clone https://github.com/systemd/systemd-centos-ci 2022-08-02 11:12:14,003 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ssh -t -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ConnectTimeout=180 -o TCPKeepAlive=yes -o ServerAliveInterval=2 -l root n53.pufty.ci.centos.org dnf clean all && dnf makecache && dnf -y install bash git rsync && rm -fr systemd-centos-ci && git clone https://github.com/systemd/systemd-centos-ci Pseudo-terminal will not be allocated because stdin is not a terminal. Warning: Permanently added 'n53.pufty.ci.centos.org,172.19.3.117' (ECDSA) to the list of known hosts. 0 files removed CentOS Stream 8 - AppStream 54 MB/s | 24 MB 00:00 failed to open: /var/cache/dnf/appstream-773ef6463612e8e2/repodata/2f5f47034f44cea88724f5ed0beaa3027732a937966ad51a180a4cd8986b89bb-filelists.xml.gz 2022-08-02 11:12:18,075 [agent-control/main] ERROR: Execution failed
2022-08-02 11:12:12,885 [agent-control/allocate_node] INFO: Attempting to allocate a node from pool metal-seamicro-large-centos-8s-x86_64 2022-08-02 11:12:14,593 [agent-control/allocate_node] INFO: Allocated node n10.pufty.ci.centos.org with session id 436 2022-08-02 11:12:14,593 [agent-control/main] INFO: PHASE 1: Setting up basic dependencies to configure CI repository 2022-08-02 11:12:14,593 [agent-control/execute_remote_command] INFO: Executing a REMOTE command on node 'n10.pufty.ci.centos.org': dnf clean all && dnf makecache && dnf -y install bash git rsync && rm -fr systemd-centos-ci && git clone https://github.com/systemd/systemd-centos-ci 2022-08-02 11:12:14,593 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ssh -t -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ConnectTimeout=180 -o TCPKeepAlive=yes -o ServerAliveInterval=2 -l root n10.pufty.ci.centos.org dnf clean all && dnf makecache && dnf -y install bash git rsync && rm -fr systemd-centos-ci && git clone https://github.com/systemd/systemd-centos-ci Pseudo-terminal will not be allocated because stdin is not a terminal. Warning: Permanently added 'n10.pufty.ci.centos.org,172.19.3.74' (ECDSA) to the list of known hosts. 0 files removed CentOS Stream 8 - AppStream 53 MB/s | 24 MB 00:00 failed to open: /var/cache/dnf/appstream-773ef6463612e8e2/repodata/2f5f47034f44cea88724f5ed0beaa3027732a937966ad51a180a4cd8986b89bb-filelists.xml.gz 2022-08-02 11:12:18,704 [agent-control/main] ERROR: Execution failed
2022-08-02 11:12:12,528 [agent-control/allocate_node] INFO: Attempting to allocate a node from pool metal-seamicro-large-centos-8s-x86_64 2022-08-02 11:12:13,985 [agent-control/allocate_node] INFO: Allocated node n53.pufty.ci.centos.org with session id 433 2022-08-02 11:12:13,986 [agent-control/main] INFO: PHASE 1: Setting up basic dependencies to configure CI repository 2022-08-02 11:12:13,986 [agent-control/execute_remote_command] INFO: Executing a REMOTE command on node 'n53.pufty.ci.centos.org': dnf clean all && dnf makecache && dnf -y install bash git rsync && rm -fr systemd-centos-ci && git clone https://github.com/systemd/systemd-centos-ci 2022-08-02 11:12:13,986 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ssh -t -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ConnectTimeout=180 -o TCPKeepAlive=yes -o ServerAliveInterval=2 -l root n53.pufty.ci.centos.org dnf clean all && dnf makecache && dnf -y install bash git rsync && rm -fr systemd-centos-ci && git clone https://github.com/systemd/systemd-centos-ci Pseudo-terminal will not be allocated because stdin is not a terminal. Warning: Permanently added 'n53.pufty.ci.centos.org,172.19.3.117' (ECDSA) to the list of known hosts. 0 files removed Waiting for process with pid 6896 to finish. CentOS Stream 8 - AppStream 51 MB/s | 24 MB 00:00 Timeout, server n53.pufty.ci.centos.org not responding. 2022-08-02 11:12:32,723 [agent-control/main] ERROR: Execution failed
2022-08-02 11:12:12,473 [agent-control/allocate_node] INFO: Attempting to allocate a node from pool metal-seamicro-large-centos-8s-x86_64 2022-08-02 11:12:13,961 [agent-control/allocate_node] INFO: Allocated node n53.pufty.ci.centos.org with session id 432 2022-08-02 11:12:13,961 [agent-control/main] INFO: PHASE 1: Setting up basic dependencies to configure CI repository 2022-08-02 11:12:13,961 [agent-control/execute_remote_command] INFO: Executing a REMOTE command on node 'n53.pufty.ci.centos.org': dnf clean all && dnf makecache && dnf -y install bash git rsync && rm -fr systemd-centos-ci && git clone https://github.com/systemd/systemd-centos-ci 2022-08-02 11:12:13,961 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ssh -t -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ConnectTimeout=180 -o TCPKeepAlive=yes -o ServerAliveInterval=2 -l root n53.pufty.ci.centos.org dnf clean all && dnf makecache && dnf -y install bash git rsync && rm -fr systemd-centos-ci && git clone https://github.com/systemd/systemd-centos-ci Pseudo-terminal will not be allocated because stdin is not a terminal. Warning: Permanently added 'n53.pufty.ci.centos.org,172.19.3.117' (ECDSA) to the list of known hosts. 6 files removed Waiting for process with pid 6896 to finish. Waiting for process with pid 6903 to finish. Timeout, server n53.pufty.ci.centos.org not responding.
Maybe the machine is still provisioning at the time we get an access to it?
Oh well, the pufty chassis is giving up:
00:18:43.942 archlinux_systemd_ci: [1531/1710] Generating man/systemd.resource-control.html with a custom command 00:18:44.157 archlinux_systemd_ci: [1532/1710] Generating man/systemd.slice.html with a custom command 00:18:44.179 archlinux_systemd_ci: [1533/1710] Generating man/systemd.network.html with a custom command 00:18:44.288 archlinux_systemd_ci: [1534/1710] Generating man/systemd.scope.5 with a custom command 00:18:44.340 archlinux_systemd_ci: [1535/1710] Generating man/systemd.service.html with a custom command 00:18:44.368 archlinux_systemd_ci: [1536/1710] Generating man/systemd.resource-control.5 with a custom command 00:18:44.379 archlinux_systemd_ci: [1537/1710] Generating man/systemd.service.5 with a custom command 00:18:44.431 archlinux_systemd_ci: [1538/1710] Generating man/systemd.slice.5 with a custom command 00:18:52.439 Timeout, server n26.pufty.ci.centos.org not responding. 00:18:52.440 2022-08-02 11:33:18,255 [agent-control/fetch_artifacts] INFO: Fetching artifacts from node n26.pufty.ci.centos.org: (remote: ~/vagrant-logs*, local: ./artifacts_8i1r4m69) 00:18:52.440 2022-08-02 11:33:18,255 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/scp -r -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no root@n26.pufty.ci.centos.org:~/vagrant-logs* ./artifacts_8i1r4m69 00:19:02.633 ssh: connect to host n26.pufty.ci.centos.org port 22: No route to host
00:00:59.743 2022-08-02 11:59:19,811 [agent-control/allocate_node] INFO: Attempting to allocate a node from pool metal-seamicro-large-centos-8s-x86_64 00:01:00.751 2022-08-02 11:59:20,819 [agent-control/allocate_node] INFO: Allocated node n45.pufty.ci.centos.org with session id 464 00:01:00.751 2022-08-02 11:59:20,819 [agent-control/main] INFO: PHASE 1: Setting up basic dependencies to configure CI repository 00:01:00.752 2022-08-02 11:59:20,819 [agent-control/execute_remote_command] INFO: Executing a REMOTE command on node 'n45.pufty.ci.centos.org': dnf clean all && dnf makecache && dnf -y install bash git rsync && rm -fr systemd-centos-ci && git clone https://github.com/systemd/systemd-centos-ci 00:01:00.752 2022-08-02 11:59:20,819 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ssh -t -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ConnectTimeout=180 -o TCPKeepAlive=yes -o ServerAliveInterval=2 -l root n45.pufty.ci.centos.org dnf clean all && dnf makecache && dnf -y install bash git rsync && rm -fr systemd-centos-ci && git clone https://github.com/systemd/systemd-centos-ci 00:01:00.757 Pseudo-terminal will not be allocated because stdin is not a terminal. 00:01:00.789 Warning: Permanently added 'n45.pufty.ci.centos.org,172.19.3.109' (ECDSA) to the list of known hosts. 00:01:01.366 0 files removed 00:01:01.856 Waiting for process with pid 6579 to finish. 00:01:27.892 CentOS Stream 8 - AppStream 949 kB/s | 4.4 kB 00:00 00:01:28.139 CentOS Stream 8 - BaseOS 778 kB/s | 3.9 kB 00:00 00:01:28.408 CentOS Stream 8 - Extras 659 kB/s | 2.9 kB 00:00 00:01:28.414 CentOS Stream 8 - Extras common packages 682 kB/s | 3.0 kB 00:00 00:01:32.343 Metadata cache created. 00:01:32.881 Waiting for process with pid 6611 to finish. 00:01:34.590 Last metadata expiration check: 0:00:06 ago on Tue 02 Aug 2022 12:59:48 PM BST. 00:01:36.888 Package bash-4.4.20-4.el8.x86_64 is already installed. 00:01:36.891 Package rsync-3.1.3-14.el8.2.x86_64 is already installed. 00:01:37.057 Dependencies resolved. 00:01:37.076 ========================================================================================= 00:01:37.076 Package Arch Version Repo Size 00:01:37.076 ========================================================================================= 00:01:37.076 Installing: 00:01:37.076 git x86_64 2.31.1-2.el8 appstream 161 k 00:01:37.076 Installing dependencies: 00:01:37.076 git-core x86_64 2.31.1-2.el8 appstream 4.7 M 00:01:37.076 git-core-doc noarch 2.31.1-2.el8 appstream 2.6 M 00:01:37.076 perl-Carp noarch 1.42-396.el8 baseos 30 k 00:01:37.076 perl-Data-Dumper x86_64 2.167-399.el8 baseos 58 k 00:01:37.076 perl-Digest noarch 1.17-395.el8 appstream 27 k 00:01:37.076 perl-Digest-MD5 x86_64 2.55-396.el8 appstream 37 k 00:01:37.076 perl-Encode x86_64 4:2.97-3.el8 baseos 1.5 M 00:01:37.076 perl-Errno x86_64 1.28-421.el8 baseos 76 k 00:01:37.076 perl-Error noarch 1:0.17025-2.el8 appstream 46 k 00:01:37.076 perl-Exporter noarch 5.72-396.el8 baseos 34 k 00:01:37.076 perl-File-Path noarch 2.15-2.el8 baseos 38 k 00:01:37.076 perl-File-Temp noarch 0.230.600-1.el8 baseos 63 k 00:01:37.076 perl-Getopt-Long noarch 1:2.50-4.el8 baseos 63 k 00:01:37.076 perl-Git noarch 2.31.1-2.el8 appstream 78 k 00:01:37.076 perl-HTTP-Tiny noarch 0.074-1.el8 baseos 58 k 00:01:37.076 perl-IO x86_64 1.38-421.el8 baseos 142 k 00:01:37.076 perl-MIME-Base64 x86_64 3.15-396.el8 baseos 31 k 00:01:37.076 perl-Net-SSLeay x86_64 1.88-1.module_el8.4.0+517+be1595ff appstream 379 k 00:01:37.076 perl-PathTools x86_64 3.74-1.el8 baseos 90 k 00:01:37.076 perl-Pod-Escapes noarch 1:1.07-395.el8 baseos 20 k 00:01:37.076 perl-Pod-Perldoc noarch 3.28-396.el8 baseos 86 k 00:01:37.076 perl-Pod-Simple noarch 1:3.35-395.el8 baseos 213 k 00:01:37.076 perl-Pod-Usage noarch 4:1.69-395.el8 baseos 34 k 00:01:37.076 perl-Scalar-List-Utils x86_64 3:1.49-2.el8 baseos 68 k 00:01:37.076 perl-Socket x86_64 4:2.027-3.el8 baseos 59 k 00:01:37.076 perl-Storable x86_64 1:3.11-3.el8 baseos 98 k 00:01:37.076 perl-Term-ANSIColor noarch 4.06-396.el8 baseos 46 k 00:01:37.076 perl-Term-Cap noarch 1.17-395.el8 baseos 23 k 00:01:37.076 perl-TermReadKey x86_64 2.37-7.el8 appstream 40 k 00:01:37.076 perl-Text-ParseWords noarch 3.30-395.el8 baseos 18 k 00:01:37.076 perl-Text-Tabs+Wrap noarch 2013.0523-395.el8 baseos 24 k 00:01:37.076 perl-Time-Local noarch 1:1.280-1.el8 baseos 34 k 00:01:37.076 perl-URI noarch 1.73-3.el8 appstream 116 k 00:01:37.076 perl-Unicode-Normalize x86_64 1.25-396.el8 baseos 82 k 00:01:37.076 perl-constant noarch 1.33-396.el8 baseos 25 k 00:01:37.076 perl-interpreter x86_64 4:5.26.3-421.el8 baseos 6.3 M 00:01:37.076 perl-libnet noarch 3.11-3.el8 appstream 121 k 00:01:37.076 perl-libs x86_64 4:5.26.3-421.el8 baseos 1.6 M 00:01:37.076 perl-macros x86_64 4:5.26.3-421.el8 baseos 72 k 00:01:37.076 perl-parent noarch 1:0.237-1.el8 baseos 20 k 00:01:37.076 perl-podlators noarch 4.11-1.el8 baseos 118 k 00:01:37.076 perl-threads x86_64 1:2.21-2.el8 baseos 61 k 00:01:37.076 perl-threads-shared x86_64 1.58-2.el8 baseos 48 k 00:01:37.076 Installing weak dependencies: 00:01:37.076 perl-IO-Socket-IP noarch 0.39-5.el8 appstream 47 k 00:01:37.076 perl-IO-Socket-SSL noarch 2.066-4.module_el8.4.0+517+be1595ff appstream 298 k 00:01:37.076 perl-Mozilla-CA noarch 20160104-7.module_el8.3.0+416+dee7bcef appstream 15 k 00:01:37.076 Enabling module streams: 00:01:37.076 perl 5.26 00:01:37.076 perl-IO-Socket-SSL 2.066 00:01:37.076 perl-libwww-perl 6.34 00:01:37.076 00:01:37.076 Transaction Summary 00:01:37.076 ========================================================================================= 00:01:37.076 Install 47 Packages 00:01:37.076 00:01:37.082 Total size: 20 M 00:01:37.082 Installed size: 73 M 00:01:37.084 Downloading Packages: 00:01:37.094 [SKIPPED] git-2.31.1-2.el8.x86_64.rpm: Already downloaded 00:01:37.094 [SKIPPED] git-core-2.31.1-2.el8.x86_64.rpm: Already downloaded 00:01:37.094 [SKIPPED] git-core-doc-2.31.1-2.el8.noarch.rpm: Already downloaded 00:01:37.094 [SKIPPED] perl-Digest-1.17-395.el8.noarch.rpm: Already downloaded 00:01:37.094 [SKIPPED] perl-Digest-MD5-2.55-396.el8.x86_64.rpm: Already downloaded 00:01:37.094 [SKIPPED] perl-Error-0.17025-2.el8.noarch.rpm: Already downloaded 00:01:37.094 [SKIPPED] perl-Git-2.31.1-2.el8.noarch.rpm: Already downloaded 00:01:37.094 [SKIPPED] perl-IO-Socket-IP-0.39-5.el8.noarch.rpm: Already downloaded 00:01:37.094 [SKIPPED] perl-IO-Socket-SSL-2.066-4.module_el8.4.0+517+be1595ff.noarch.rpm: Already downloaded 00:01:37.095 [SKIPPED] perl-Mozilla-CA-20160104-7.module_el8.3.0+416+dee7bcef.noarch.rpm: Already downloaded 00:01:37.095 [SKIPPED] perl-Net-SSLeay-1.88-1.module_el8.4.0+517+be1595ff.x86_64.rpm: Already downloaded 00:01:37.095 [SKIPPED] perl-TermReadKey-2.37-7.el8.x86_64.rpm: Already downloaded 00:01:37.095 [SKIPPED] perl-URI-1.73-3.el8.noarch.rpm: Already downloaded 00:01:37.095 [SKIPPED] perl-libnet-3.11-3.el8.noarch.rpm: Already downloaded 00:01:37.095 [SKIPPED] perl-Carp-1.42-396.el8.noarch.rpm: Already downloaded 00:01:37.095 [SKIPPED] perl-Data-Dumper-2.167-399.el8.x86_64.rpm: Already downloaded 00:01:37.095 [SKIPPED] perl-Encode-2.97-3.el8.x86_64.rpm: Already downloaded 00:01:37.096 [SKIPPED] perl-Errno-1.28-421.el8.x86_64.rpm: Already downloaded 00:01:37.096 [SKIPPED] perl-Exporter-5.72-396.el8.noarch.rpm: Already downloaded 00:01:37.096 [SKIPPED] perl-File-Path-2.15-2.el8.noarch.rpm: Already downloaded 00:01:37.096 [SKIPPED] perl-File-Temp-0.230.600-1.el8.noarch.rpm: Already downloaded 00:01:37.096 [SKIPPED] perl-Getopt-Long-2.50-4.el8.noarch.rpm: Already downloaded 00:01:37.096 [SKIPPED] perl-HTTP-Tiny-0.074-1.el8.noarch.rpm: Already downloaded 00:01:37.096 [SKIPPED] perl-IO-1.38-421.el8.x86_64.rpm: Already downloaded 00:01:37.096 [SKIPPED] perl-MIME-Base64-3.15-396.el8.x86_64.rpm: Already downloaded 00:01:37.096 [SKIPPED] perl-PathTools-3.74-1.el8.x86_64.rpm: Already downloaded 00:01:37.097 [SKIPPED] perl-Pod-Escapes-1.07-395.el8.noarch.rpm: Already downloaded 00:01:37.097 [SKIPPED] perl-Pod-Perldoc-3.28-396.el8.noarch.rpm: Already downloaded 00:01:37.097 [SKIPPED] perl-Pod-Simple-3.35-395.el8.noarch.rpm: Already downloaded 00:01:37.097 [SKIPPED] perl-Pod-Usage-1.69-395.el8.noarch.rpm: Already downloaded 00:01:37.097 [SKIPPED] perl-Scalar-List-Utils-1.49-2.el8.x86_64.rpm: Already downloaded 00:01:37.097 [SKIPPED] perl-Socket-2.027-3.el8.x86_64.rpm: Already downloaded 00:01:37.097 [SKIPPED] perl-Storable-3.11-3.el8.x86_64.rpm: Already downloaded 00:01:37.097 [SKIPPED] perl-Term-ANSIColor-4.06-396.el8.noarch.rpm: Already downloaded 00:01:37.097 [SKIPPED] perl-Term-Cap-1.17-395.el8.noarch.rpm: Already downloaded 00:01:37.098 [SKIPPED] perl-Text-ParseWords-3.30-395.el8.noarch.rpm: Already downloaded 00:01:37.098 [SKIPPED] perl-Text-Tabs+Wrap-2013.0523-395.el8.noarch.rpm: Already downloaded 00:01:37.098 [SKIPPED] perl-Time-Local-1.280-1.el8.noarch.rpm: Already downloaded 00:01:37.098 [SKIPPED] perl-Unicode-Normalize-1.25-396.el8.x86_64.rpm: Already downloaded 00:01:37.098 [SKIPPED] perl-constant-1.33-396.el8.noarch.rpm: Already downloaded 00:01:37.099 [SKIPPED] perl-interpreter-5.26.3-421.el8.x86_64.rpm: Already downloaded 00:01:37.099 [SKIPPED] perl-libs-5.26.3-421.el8.x86_64.rpm: Already downloaded 00:01:37.099 [SKIPPED] perl-macros-5.26.3-421.el8.x86_64.rpm: Already downloaded 00:01:37.099 [SKIPPED] perl-parent-0.237-1.el8.noarch.rpm: Already downloaded 00:01:37.099 [SKIPPED] perl-podlators-4.11-1.el8.noarch.rpm: Already downloaded 00:01:37.099 [SKIPPED] perl-threads-2.21-2.el8.x86_64.rpm: Already downloaded 00:01:37.099 [SKIPPED] perl-threads-shared-1.58-2.el8.x86_64.rpm: Already downloaded 00:01:38.168 Running transaction check 00:01:38.168 Waiting for process with pid 6611 to finish. 00:01:48.251 Transaction check succeeded. 00:01:48.251 Running transaction test 00:01:48.521 FATAL ERROR: python callback ??? failed, aborting!
00:01:50.132 2022-08-02 12:00:09,202 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ssh -t -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ConnectTimeout=180 -o TCPKeepAlive=yes -o ServerAliveInterval=2 -l root n45.pufty.ci.centos.org systemd-centos-ci/vagrant/vagrant-ci-wrapper.sh -d 'arch-sanitizers-clang' -r 'pr:24178' 00:01:50.138 Pseudo-terminal will not be allocated because stdin is not a terminal. 00:01:50.174 Warning: Permanently added 'n45.pufty.ci.centos.org,172.19.3.109' (ECDSA) to the list of known hosts. 00:01:50.579 [TASK-CONTROL] OPTIMAL_QEMU_SMP = 4 00:01:50.579 [TASK-CONTROL] MAX_QUEUE_SIZE = 2 00:01:50.586 Cloning repo: https://github.com/systemd/systemd.git 00:01:50.589 Cloning into 'systemd'... 00:02:12.612 Timeout, server n45.pufty.ci.centos.org not responding. 00:02:12.613 2022-08-02 12:00:31,683 [agent-control/fetch_artifacts] INFO: Fetching artifacts from node n45.pufty.ci.centos.org: (remote: ~/vagrant-logs*, local: ./artifacts_3vttpv74) 00:02:12.613 2022-08-02 12:00:31,683 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/scp -r -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no root@n45.pufty.ci.centos.org:~/vagrant-logs* ./artifacts_3vttpv74 00:03:19.942 ssh: connect to host n45.pufty.ci.centos.org port 22: No route to host
nope, it's more that somehow is still listed as available for another tenant and trying to get @nphilipp's help to understand what's going on
And just to confirm - this is not exclusive to the seamicro chassis, it happens for EC2/metal machines as well:
2022-08-03 11:07:15,273 [agent-control/allocate_node] INFO: Attempting to allocate a node from pool metal-ec2-c5n-centos-8s-x86_64 2022-08-03 11:07:16,217 [agent-control/allocate_node] INFO: Allocated node n27-31-90.pool.ci.centos.org with session id 817 2022-08-03 11:07:16,217 [agent-control/main] INFO: PHASE 1: Setting up basic dependencies to configure CI repository 2022-08-03 11:07:16,218 [agent-control/execute_remote_command] INFO: Executing a REMOTE command on node 'n27-31-90.pool.ci.centos.org': dnf clean all && dnf makecache && dnf -y install bash git rsync && rm -fr systemd-centos-ci && git clone https://github.com/systemd/systemd-centos-ci 2022-08-03 11:07:16,218 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ssh -t -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ConnectTimeout=180 -o TCPKeepAlive=yes -o ServerAliveInterval=2 -l root n27-31-90.pool.ci.centos.org dnf clean all && dnf makecache && dnf -y install bash git rsync && rm -fr systemd-centos-ci && git clone https://github.com/systemd/systemd-centos-ci Pseudo-terminal will not be allocated because stdin is not a terminal. Warning: Permanently added 'n27-31-90.pool.ci.centos.org,172.27.31.90' (ECDSA) to the list of known hosts. 27 files removed CentOS Stream 8 - AppStream 27 MB/s | 24 MB 00:00 CentOS Stream 8 - BaseOS 92 MB/s | 25 MB 00:00 CentOS Stream 8 - Extras 89 kB/s | 18 kB 00:00 CentOS Stream 8 - Extras common packages 9.1 kB/s | 4.6 kB 00:00 Metadata cache created. Last metadata expiration check: 0:00:02 ago on Wed 03 Aug 2022 11:07:27 AM UTC. Package bash-4.4.20-4.el8.x86_64 is already installed. Package rsync-3.1.3-14.el8.2.x86_64 is already installed. Dependencies resolved. ========================================================================================= Package Arch Version Repo Size ========================================================================================= Installing: git x86_64 2.31.1-2.el8 appstream 161 k Installing dependencies: emacs-filesystem noarch 1:26.1-7.el8 baseos 70 k git-core x86_64 2.31.1-2.el8 appstream 4.7 M git-core-doc noarch 2.31.1-2.el8 appstream 2.6 M perl-Carp noarch 1.42-396.el8 baseos 30 k perl-Data-Dumper x86_64 2.167-399.el8 baseos 58 k perl-Digest noarch 1.17-395.el8 appstream 27 k perl-Digest-MD5 x86_64 2.55-396.el8 appstream 37 k perl-Encode x86_64 4:2.97-3.el8 baseos 1.5 M perl-Errno x86_64 1.28-421.el8 baseos 76 k perl-Error noarch 1:0.17025-2.el8 appstream 46 k perl-Exporter noarch 5.72-396.el8 baseos 34 k perl-File-Path noarch 2.15-2.el8 baseos 38 k perl-File-Temp noarch 0.230.600-1.el8 baseos 63 k perl-Getopt-Long noarch 1:2.50-4.el8 baseos 63 k perl-Git noarch 2.31.1-2.el8 appstream 78 k perl-HTTP-Tiny noarch 0.074-1.el8 baseos 58 k perl-IO x86_64 1.38-421.el8 baseos 142 k perl-MIME-Base64 x86_64 3.15-396.el8 baseos 31 k perl-Net-SSLeay x86_64 1.88-1.module_el8.4.0+517+be1595ff appstream 379 k perl-PathTools x86_64 3.74-1.el8 baseos 90 k perl-Pod-Escapes noarch 1:1.07-395.el8 baseos 20 k perl-Pod-Perldoc noarch 3.28-396.el8 baseos 86 k perl-Pod-Simple noarch 1:3.35-395.el8 baseos 213 k perl-Pod-Usage noarch 4:1.69-395.el8 baseos 34 k perl-Scalar-List-Utils x86_64 3:1.49-2.el8 baseos 68 k perl-Socket x86_64 4:2.027-3.el8 baseos 59 k perl-Storable x86_64 1:3.11-3.el8 baseos 98 k perl-Term-ANSIColor noarch 4.06-396.el8 baseos 46 k perl-Term-Cap noarch 1.17-395.el8 baseos 23 k perl-TermReadKey x86_64 2.37-7.el8 appstream 40 k perl-Text-ParseWords noarch 3.30-395.el8 baseos 18 k perl-Text-Tabs+Wrap noarch 2013.0523-395.el8 baseos 24 k perl-Time-Local noarch 1:1.280-1.el8 baseos 34 k perl-URI noarch 1.73-3.el8 appstream 116 k perl-Unicode-Normalize x86_64 1.25-396.el8 baseos 82 k perl-constant noarch 1.33-396.el8 baseos 25 k perl-interpreter x86_64 4:5.26.3-421.el8 baseos 6.3 M perl-libnet noarch 3.11-3.el8 appstream 121 k perl-libs x86_64 4:5.26.3-421.el8 baseos 1.6 M perl-macros x86_64 4:5.26.3-421.el8 baseos 72 k perl-parent noarch 1:0.237-1.el8 baseos 20 k perl-podlators noarch 4.11-1.el8 baseos 118 k perl-threads x86_64 1:2.21-2.el8 baseos 61 k perl-threads-shared x86_64 1.58-2.el8 baseos 48 k Installing weak dependencies: perl-IO-Socket-IP noarch 0.39-5.el8 appstream 47 k perl-IO-Socket-SSL noarch 2.066-4.module_el8.4.0+517+be1595ff appstream 298 k perl-Mozilla-CA noarch 20160104-7.module_el8.3.0+416+dee7bcef appstream 15 k Enabling module streams: perl 5.26 perl-IO-Socket-SSL 2.066 perl-libwww-perl 6.34 Transaction Summary ========================================================================================= Install 48 Packages Total download size: 20 M Installed size: 73 M Downloading Packages: (1/48): git-2.31.1-2.el8.x86_64.rpm 6.7 MB/s | 161 kB 00:00 (2/48): perl-Digest-1.17-395.el8.noarch.rpm 7.1 MB/s | 27 kB 00:00 (3/48): perl-Digest-MD5-2.55-396.el8.x86_64.rpm 7.0 MB/s | 37 kB 00:00 (4/48): perl-Error-0.17025-2.el8.noarch.rpm 13 MB/s | 46 kB 00:00 (5/48): perl-Git-2.31.1-2.el8.noarch.rpm 20 MB/s | 78 kB 00:00 (6/48): perl-IO-Socket-IP-0.39-5.el8.noarch.rpm 9.4 MB/s | 47 kB 00:00 (7/48): git-core-doc-2.31.1-2.el8.noarch.rpm 11 MB/s | 2.6 MB 00:00 (8/48): perl-IO-Socket-SSL-2.066-4.module_el8.4 901 kB/s | 298 kB 00:00 (9/48): perl-Mozilla-CA-20160104-7.module_el8.3 118 kB/s | 15 kB 00:00 (10/48): perl-TermReadKey-2.37-7.el8.x86_64.rpm 9.2 MB/s | 40 kB 00:00 (11/48): perl-Net-SSLeay-1.88-1.module_el8.4.0+ 5.5 MB/s | 379 kB 00:00 (12/48): perl-URI-1.73-3.el8.noarch.rpm 1.7 MB/s | 116 kB 00:00 (13/48): perl-libnet-3.11-3.el8.noarch.rpm 1.6 MB/s | 121 kB 00:00 (14/48): emacs-filesystem-26.1-7.el8.noarch.rpm 973 kB/s | 70 kB 00:00 Timeout, server n27-31-90.pool.ci.centos.org not responding.
Metadata Update from @zlopez: - Issue priority set to: 🔥 Urgent 🔥 (was: Needs Review)
Metadata Update from @arrfab: - Issue assigned to nphilipp
Quick update before the weekend: To sort out the different requests stepping on each others’ toes, I'm working on changing Duffy to retry the failing query a couple of times (before giving up) so allocating nodes gets effectively serialized for them.
Thanks for the update! I've done the same on my side this morning (i.e. retry the request if duffy.client raises httpx.HTTPStatusError) and so far it seems to be working pretty well (no more unexpected job fails). So I guess this appears to be the last issue, at least from the user point of view, regarding node/API stability
duffy.client
httpx.HTTPStatusError
👍 This is the workaround I'd suggest to anyone running into this problem until we have a fix deployed.
Yesterday I moved 2/3 of our jobs over to the EC2 metal machines, and since then I see some frequent intermittent issues causing the jobs to randomly fail:
2022-08-09 22:43:33,992 [agent-control/allocate_node] INFO: Attempting to allocate a node from pool metal-ec2-c5n-centos-8s-x86_64 2022-08-09 22:44:17,880 [agent-control/allocate_node] INFO: Allocated node n27-28-26.pool.ci.centos.org with session id 3406 2022-08-09 22:44:17,880 [agent-control/main] INFO: PHASE 1: Setting up basic dependencies to configure CI repository 2022-08-09 22:44:17,880 [agent-control/execute_remote_command] INFO: Executing a REMOTE command on node 'n27-28-26.pool.ci.centos.org': dnf clean all && dnf makecache && dnf -y install bash git rsync && rm -fr systemd-centos-ci && git clone https://github.com/systemd/systemd-centos-ci 2022-08-09 22:44:17,880 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ssh -t -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ConnectTimeout=180 -o TCPKeepAlive=yes -o ServerAliveInterval=2 -l root n27-28-26.pool.ci.centos.org dnf clean all && dnf makecache && dnf -y install bash git rsync && rm -fr systemd-centos-ci && git clone https://github.com/systemd/systemd-centos-ci Pseudo-terminal will not be allocated because stdin is not a terminal. Warning: Permanently added 'n27-28-26.pool.ci.centos.org,172.27.28.26' (ECDSA) to the list of known hosts. 0 files removed Waiting for process with pid 2763 to finish. CentOS Stream 8 - AppStream 98 kB/s | 4.4 kB 00:00 CentOS Stream 8 - BaseOS 21 kB/s | 3.9 kB 00:00 CentOS Stream 8 - Extras 38 kB/s | 2.9 kB 00:00 Metadata cache created. Last metadata expiration check: 0:00:02 ago on Tue 09 Aug 2022 10:44:33 PM UTC. Package bash-4.4.20-1.el8_4.x86_64 is already installed. Package rsync-3.1.3-12.el8.x86_64 is already installed. Dependencies resolved. ========================================================================================= Package Arch Version Repo Size ========================================================================================= Installing: git x86_64 2.31.1-2.el8 appstream 161 k Upgrading: bash x86_64 4.4.20-4.el8 baseos 1.5 M rsync x86_64 3.1.3-16.el8 baseos 406 k <...snip...> (47/50): perl-threads-2.21-2.el8.x86_64.rpm 8.3 MB/s | 61 kB 00:00 (48/50): perl-threads-shared-1.58-2.el8.x86_64. 5.6 MB/s | 48 kB 00:00 (49/50): perl-libs-5.26.3-421.el8.x86_64.rpm 22 MB/s | 1.6 MB 00:00 (50/50): perl-interpreter-5.26.3-421.el8.x86_64 22 MB/s | 6.3 MB 00:00 -------------------------------------------------------------------------------- Total 22 MB/s | 20 MB 00:00 Running transaction check Waiting for process with pid 2763 to finish. The downloaded packages were saved in cache until the next successful transaction. You can remove cached packages by executing 'dnf clean packages'. Traceback (most recent call last): File "/usr/bin/dnf", line 58, in <module> main.user_main(sys.argv[1:], exit_code=True) File "/usr/lib/python3.6/site-packages/dnf/cli/main.py", line 201, in user_main errcode = main(args) File "/usr/lib/python3.6/site-packages/dnf/cli/main.py", line 67, in main return _main(base, args, cli_class, option_parser_class) File "/usr/lib/python3.6/site-packages/dnf/cli/main.py", line 106, in _main return cli_run(cli, base) File "/usr/lib/python3.6/site-packages/dnf/cli/main.py", line 130, in cli_run ret = resolving(cli, base) File "/usr/lib/python3.6/site-packages/dnf/cli/main.py", line 176, in resolving base.do_transaction(display=displays) File "/usr/lib/python3.6/site-packages/dnf/cli/cli.py", line 245, in do_transaction display = [output.CliTransactionDisplay()] + list(display) File "/usr/lib/python3.6/site-packages/dnf/base.py", line 891, in do_transaction timer() File "/usr/lib/python3.6/site-packages/dnf/db/group.py", line 337, in _populate_rpm_ts ts.addErase(tsi.pkg.idx) File "/usr/lib64/python3.6/site-packages/rpm/transaction.py", line 109, in addErase raise rpm.error("package not installed") _rpm.error: package not installed 2022-08-09 22:49:17,836 [agent-control/main] ERROR: Execution failed
2022-08-09 23:04:08,594 [agent-control/allocate_node] ERROR: [Try 168/1440] Received an API error from the server: detail="can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'" 2022-08-09 23:05:12,187 [agent-control/allocate_node] ERROR: [Try 180/1440] Received an API error from the server: detail="can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'" 2022-08-09 23:06:15,838 [agent-control/allocate_node] ERROR: [Try 192/1440] Received an API error from the server: detail="can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'" 2022-08-09 23:07:20,665 [agent-control/allocate_node] INFO: Allocated node n27-27-1.pool.ci.centos.org with session id 3681 2022-08-09 23:07:20,666 [agent-control/main] INFO: PHASE 1: Setting up basic dependencies to configure CI repository 2022-08-09 23:07:20,666 [agent-control/execute_remote_command] INFO: Executing a REMOTE command on node 'n27-27-1.pool.ci.centos.org': dnf clean all && dnf makecache && dnf -y install bash git rsync && rm -fr systemd-centos-ci && git clone https://github.com/systemd/systemd-centos-ci 2022-08-09 23:07:20,666 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ssh -t -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ConnectTimeout=180 -o TCPKeepAlive=yes -o ServerAliveInterval=2 -l root n27-27-1.pool.ci.centos.org dnf clean all && dnf makecache && dnf -y install bash git rsync && rm -fr systemd-centos-ci && git clone https://github.com/systemd/systemd-centos-ci Pseudo-terminal will not be allocated because stdin is not a terminal. Warning: Permanently added 'n27-27-1.pool.ci.centos.org,172.27.27.1' (ECDSA) to the list of known hosts. 0 files removed Waiting for process with pid 2625 to finish. CentOS Stream 8 - AppStream 45 kB/s | 4.4 kB 00:00 CentOS Stream 8 - BaseOS 15 kB/s | 3.9 kB 00:00 CentOS Stream 8 - Extras 16 kB/s | 2.9 kB 00:00 Metadata cache created. Last metadata expiration check: 0:00:02 ago on Tue 09 Aug 2022 11:07:38 PM UTC. Package bash-4.4.20-1.el8_4.x86_64 is already installed. Package rsync-3.1.3-12.el8.x86_64 is already installed. Dependencies resolved. ========================================================================================= Package Arch Version Repo Size ========================================================================================= Installing: git x86_64 2.31.1-2.el8 appstream 161 k Upgrading: bash x86_64 4.4.20-4.el8 baseos 1.5 M <...snip...> (49/50): perl-threads-shared-1.58-2.el8.x86_64. 4.1 MB/s | 48 kB 00:00 (50/50): perl-interpreter-5.26.3-421.el8.x86_64 50 MB/s | 6.3 MB 00:00 -------------------------------------------------------------------------------- Total 13 MB/s | 20 MB 00:01 Running transaction check Waiting for process with pid 2625 to finish. The downloaded packages were saved in cache until the next successful transaction. You can remove cached packages by executing 'dnf clean packages'. [Errno 2] No such file or directory: '/var/cache/dnf/baseos-5d5072979dfa6055/packages/bash-4.4.20-4.el8.x86_64.rpm'
The dnf errors are similar to the ones where more than one session shared a machine hence stepping on each other's toes, but in this case I can't find any nodes belonging to multiple sessions on my side.
What can I see is one session stuck in a "limbo":
(duffy) [systemd@slave01 ~]$ duffy client --format=flat list-sessions | sort session_id=3864 active=TRUE created_at='2022-08-10 05:46:01.445507+00:00' retired_at= pool= hostname='n27-44-255.pool.ci.centos.org' ipaddr='172.27.44.255'
(missing pool and the machine doesn't belong to any job).
Also, the frequent API errors are also a bit suspicious:
2022-08-10 07:49:24,844 [agent-control/allocate_node] ERROR: [Try 12/1440] Received an API error from the server: detail="can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'"
but that might be expected? I don't know how big the EC2 metal Duffy pool is.
FTR, here are all sessions/machines which failed with one of the above errors:
2022-08-09 22:44:17,880 [agent-control/allocate_node] INFO: Allocated node n27-28-26.pool.ci.centos.org with session id 3406 2022-08-09 23:07:20,665 [agent-control/allocate_node] INFO: Allocated node n27-27-1.pool.ci.centos.org with session id 3681 2022-08-09 22:47:20,431 [agent-control/allocate_node] INFO: Allocated node n27-24-218.pool.ci.centos.org with session id 3441 2022-08-09 22:52:18,472 [agent-control/allocate_node] INFO: Allocated node n27-12-14.pool.ci.centos.org with session id 3510 2022-08-10 00:51:48,245 [agent-control/allocate_node] INFO: Allocated node n27-46-161.pool.ci.centos.org with session id 3770 2022-08-10 05:29:01,853 [agent-control/allocate_node] INFO: Allocated node n27-2-195.pool.ci.centos.org with session id 3863 2022-08-10 07:49:57,979 [agent-control/allocate_node] INFO: Allocated node n27-12-0.pool.ci.centos.org with session id 3980 2022-08-10 07:37:08,962 [agent-control/allocate_node] INFO: Allocated node n27-8-206.pool.ci.centos.org with session id 3958 2022-08-10 07:54:41,722 [agent-control/allocate_node] INFO: Allocated node n27-29-29.pool.ci.centos.org with session id 4011 2022-08-10 08:01:39,165 [agent-control/allocate_node] INFO: Allocated node n27-12-134.pool.ci.centos.org with session id 4020 2022-08-10 08:49:47,876 [agent-control/allocate_node] INFO: Allocated node n27-15-78.pool.ci.centos.org with session id 4218
I wonder if our workload is not racing against the cloud-init stuff that runs right after provisioning the machine, but that should, hopefully, be guarded by the dnf locks...
The machines aren't provisioned when you request them, but ahead of time so Duffy can hand them out quickly, only the tenant SSH keys are installed when they're allocated to your session.
No idea what's happening there. @arrfab @phsmoura, any idea?
for ec2 bare-metal, it's really expensive so we only provision one node at a time :
duffy client show-pool metal-ec2-c5n-centos-8s-x86_64 { "action": "get", "pool": { "name": "metal-ec2-c5n-centos-8s-x86_64", "fill_level": 1, "levels": { "provisioning": 0, "ready": 1, "contextualizing": 0, "deployed": 0, "deprovisioning": 0 } } }
It's true that cloud-init is doing a full update as first thing so if you request a node and its still updating in the background, that can explain this but that would mean that dnf doesn't work as it should .. We can disable that cloud-init dnf update but that means just sometimes handing over a (very) old instance (images aren't refreshed on regular basis)
From our ansible setup :
#cloud-config disable_root: False runcmd: - hostnamectl set-hostname $(hostname|sed 's/ip-172-/n/') - dnf update -y
Is there an easy (reliable) way to check if cloud-init is finished which tenants could check?
for ec2 bare-metal, it's really expensive so we only provision one node at a time : duffy client show-pool metal-ec2-c5n-centos-8s-x86_64 { "action": "get", "pool": { "name": "metal-ec2-c5n-centos-8s-x86_64", "fill_level": 1, "levels": { "provisioning": 0, "ready": 1, "contextualizing": 0, "deployed": 0, "deprovisioning": 0 } } } It's true that cloud-init is doing a full update as first thing so if you request a node and its still updating in the background, that can explain this but that would mean that dnf doesn't work as it should .. We can disable that cloud-init dnf update but that means just sometimes handing over a (very) old instance (images aren't refreshed on regular basis) From our ansible setup : ``` cloud-config disable_root: False runcmd: - hostnamectl set-hostname $(hostname|sed 's/ip-172-/n/') - dnf update -y ```
```
disable_root: False runcmd: - hostnamectl set-hostname $(hostname|sed 's/ip-172-/n/') - dnf update -y ```
It looks like our job downloads the packages before cloud-init's update starts, then cloud-init's update starts, takes the lock, performs the upgrade (which contains packages our jobs just downloaded), that upgrade cleans up the downloaded packages, releases the lock, and our job fails, since the just downloaded packages aren't there. That's unfortunate (and should be handled by dnf imo).
Disabling cloud-init would make sense (we do the system update ourselves anyway), but I'm not sure if it would cause issues for other projects.
Theoretically, something like
systemd-run --wait -p Wants=cloud-init.target -p After=cloud-init.target true
should do the trick (maybe there's a nicer solution). I'll try this first, so we can keep cloud-init intact.
Theoretically, something like systemd-run --wait -p Wants=cloud-init.target -p After=cloud-init.target true should do the trick (maybe there's a nicer solution). I'll try this first, so we can keep cloud-init intact.
So far it seems to be indeed doing the right thing, so I'll push it to production and let it bake for a couple of days.
Oh well, nevermind. That seemed to work for the first hour, but then most jobs either started timing out:
perl-parent-1:0.237-1.el8.noarch perl-podlators-4.11-1.el8.noarch perl-threads-1:2.21-2.el8.x86_64 perl-threads-shared-1.58-2.el8.x86_64 Complete! Cloning into 'systemd-centos-ci'... 2022-08-11 18:23:24,842 [agent-control/main] INFO: PHASE 2: Run tests in Vagrant VMs 2022-08-11 18:23:24,843 [agent-control/execute_remote_command] INFO: Executing a REMOTE command on node 'n27-20-126.pool.ci.centos.org': systemd-centos-ci/vagrant/vagrant-ci-wrapper.sh -d 'arch' -r 'pr:24263' 2022-08-11 18:23:24,843 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ssh -t -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ConnectTimeout=180 -o TCPKeepAlive=yes -o ServerAliveInterval=2 -l root n27-20-126.pool.ci.centos.org systemd-centos-ci/vagrant/vagrant-ci-wrapper.sh -d 'arch' -r 'pr:24263' Pseudo-terminal will not be allocated because stdin is not a terminal. Warning: Permanently added 'n27-20-126.pool.ci.centos.org,172.27.20.126' (ECDSA) to the list of known hosts. Timeout, server n27-20-126.pool.ci.centos.org not responding.
or getting timeouts from the API:
2022-08-11 18:43:10,160 [agent-control/allocate_node] ERROR: [Try 48/1440] Received an API error from the server: detail="can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'" 2022-08-11 18:43:36,053 [agent-control/main] ERROR: Execution failed Traceback (most recent call last): File "/usr/local/lib/python3.8/site-packages/httpcore/_exceptions.py", line 8, in map_exceptions yield File "/usr/local/lib/python3.8/site-packages/httpcore/backends/sync.py", line 26, in read return self._sock.recv(max_bytes) File "/usr/lib64/python3.8/ssl.py", line 1226, in recv return self.read(buflen) File "/usr/lib64/python3.8/ssl.py", line 1101, in read return self._sslobj.read(len) socket.timeout: The read operation timed out During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.8/site-packages/httpx/_transports/default.py", line 60, in map_httpcore_exceptions yield File "/usr/local/lib/python3.8/site-packages/httpx/_transports/default.py", line 218, in handle_request resp = self._pool.handle_request(req) File "/usr/local/lib/python3.8/site-packages/httpcore/_sync/connection_pool.py", line 253, in handle_request raise exc File "/usr/local/lib/python3.8/site-packages/httpcore/_sync/connection_pool.py", line 237, in handle_request response = connection.handle_request(request) File "/usr/local/lib/python3.8/site-packages/httpcore/_sync/connection.py", line 90, in handle_request return self._connection.handle_request(request) File "/usr/local/lib/python3.8/site-packages/httpcore/_sync/http11.py", line 105, in handle_request raise exc File "/usr/local/lib/python3.8/site-packages/httpcore/_sync/http11.py", line 84, in handle_request ) = self._receive_response_headers(**kwargs) File "/usr/local/lib/python3.8/site-packages/httpcore/_sync/http11.py", line 148, in _receive_response_headers event = self._receive_event(timeout=timeout) File "/usr/local/lib/python3.8/site-packages/httpcore/_sync/http11.py", line 177, in _receive_event data = self._network_stream.read( File "/usr/local/lib/python3.8/site-packages/httpcore/backends/sync.py", line 26, in read return self._sock.recv(max_bytes) File "/usr/lib64/python3.8/contextlib.py", line 131, in __exit__ self.gen.throw(type, value, traceback) File "/usr/local/lib/python3.8/site-packages/httpcore/_exceptions.py", line 12, in map_exceptions raise to_exc(exc) httpcore.ReadTimeout: The read operation timed out The above exception was the direct cause of the following exception: Traceback (most recent call last): File "./agent-control.py", line 330, in main ac.allocate_node(args.pool) File "./agent-control.py", line 69, in allocate_node result = self._client.request_session([payload]) File "/usr/local/lib/python3.8/site-packages/duffy/client/main.py", line 123, in request_session return self._query_method( File "/usr/local/lib/python3.8/site-packages/duffy/client/main.py", line 94, in _query_method response = client_method(url=url, **add_kwargs) File "/usr/local/lib/python3.8/site-packages/httpx/_client.py", line 1130, in post return self.request( File "/usr/local/lib/python3.8/site-packages/httpx/_client.py", line 815, in request return self.send(request, auth=auth, follow_redirects=follow_redirects) File "/usr/local/lib/python3.8/site-packages/httpx/_client.py", line 902, in send response = self._send_handling_auth( File "/usr/local/lib/python3.8/site-packages/httpx/_client.py", line 930, in _send_handling_auth response = self._send_handling_redirects( File "/usr/local/lib/python3.8/site-packages/httpx/_client.py", line 967, in _send_handling_redirects response = self._send_single_request(request) File "/usr/local/lib/python3.8/site-packages/httpx/_client.py", line 1003, in _send_single_request response = transport.handle_request(request) File "/usr/local/lib/python3.8/site-packages/httpx/_transports/default.py", line 218, in handle_request resp = self._pool.handle_request(req) File "/usr/lib64/python3.8/contextlib.py", line 131, in __exit__ self.gen.throw(type, value, traceback) File "/usr/local/lib/python3.8/site-packages/httpx/_transports/default.py", line 77, in map_httpcore_exceptions raise mapped_exc(message) from exc httpx.ReadTimeout: The read operation timed out
After another round of debugging, this particular issue is caused by systemd (oh no). Long story short: as the AMIs are quite outdated, the system update performed by cloud-init bumps the systemd package several versions ahead, one of them introducing a backward-incompatible change for the systemd user units. This, in combination with the cloud-init workaround above causes an error when the ssh session ends, which triggers a session restart just in time when we reconnect to the ssh for the second time, killing the ssh connection and causing the test to fail:
cloud-init
[ 379.336057] n27-16-16.pool.ci.centos.org sshd[5604]: Disconnected from user root 172.19.0.16 port 58460 [ 379.338082] n27-16-16.pool.ci.centos.org sshd[5463]: pam_unix(sshd:session): session closed for user root [ 379.340102] n27-16-16.pool.ci.centos.org systemd-logind[1923]: Session 5 logged out. Waiting for processes to exit. [ 379.340327] n27-16-16.pool.ci.centos.org systemd[1]: session-5.scope: Succeeded. [ 379.341208] n27-16-16.pool.ci.centos.org systemd-logind[1923]: Removed session 5. [ 379.342297] n27-16-16.pool.ci.centos.org systemd[1]: Stopping User Manager for UID 0... [ 379.343138] n27-16-16.pool.ci.centos.org systemd[5562]: /usr/lib/systemd/user/systemd-exit.service:16: Failed to parse failure action specifier, ignoring: exit-force [ 379.343488] n27-16-16.pool.ci.centos.org systemd[5562]: systemd-exit.service: Service lacks both ExecStart= and ExecStop= setting. Refusing. [ 379.343525] n27-16-16.pool.ci.centos.org systemd[5562]: Failed to enqueue exit.target job: Unit systemd-exit.service has a bad unit file setting. [ 379.941173] n27-16-16.pool.ci.centos.org sshd[63794]: Accepted publickey for root from 172.19.0.16 port 58582 ssh2: RSA SHA256:XI6P95wU3TyZqN8YTWLscn48i9t+dMg1xgN9BgfCmDo [ 379.947015] n27-16-16.pool.ci.centos.org systemd-logind[1923]: New session 7 of user root. [ 379.947295] n27-16-16.pool.ci.centos.org sshd[63794]: pam_systemd(sshd:session): Failed to create session: Start job for unit user-0.slice failed with 'canceled'
This is a long-standing issue (i.e. it's still marked as FIXME even in the Fedora systemd specfile: https://src.fedoraproject.org/rpms/systemd/blob/rawhide/f/systemd.spec#_797). Fix should, hopefully, be fairly trivial, by extending the previous workaround a bit:
systemd-run --wait -t -p Wants=cloud-init.target -p After=cloud-init.target true && systemctl --user daemon-reexec
I'll give it a try (and hopefully it's the last issue for now).
I don't want to jinx it, but with:
systemd-run --wait -p Wants=cloud-init.target -p After=cloud-init.target true && systemctl --user daemon-reexec
all involved parties are finally content and the jobs don't fail anymore.
The workaround is a bit ugly, since we miss some upstream systemd patches in C8S which would allow it to be a single command, e.g.:
systemd-run --wait -p Wants=cloud-init.target -p After=cloud-init.target systemctl -M root@.host --user daemon-reexec
Another option would be:
systemd-run --wait -E DBUS_SESSION_BUS_ADDRESS=$DBUS_SESSION_BUS_ADDRESS -p Wants=cloud-init.target -p After=cloud-init.target systemctl --user daemon-reexec
but SELinux complains about this one, so it's a no-go again.
So, after moving all upstream jobs over to the metal machines, I noticed that Duffy only provisions one metal machine at a time and returns can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64' for all other API requests during that time. I know (after speaking with @arrfab), that only one machine is "pre-provisioned" in the metal pools, since the machines are expensive (which is totally okay), but if you have multiple requests (even from a single project) trying to request a metal machine, it can take a long time of going back and forth with Duffy before it actually provisions a machine, given that it only provisions one machine at the time.
can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'
For example, after spawning ten jobs (the entire project quota ATM), all for the metal machines, it took the last job ~30 minutes before Duffy got through all the other provisions and accepted the request:
00:01:21.556 2022-08-18 17:59:11,243 [agent-control/allocate_node] INFO: Attempting to allocate a node from pool metal-ec2-c5n-centos-8s-x86_64 00:02:24.819 2022-08-18 18:00:14,505 [agent-control/allocate_node] ERROR: [Try 12/1440] Received an API error from the server: detail="can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'" 00:03:42.296 2022-08-18 18:01:31,980 [agent-control/allocate_node] ERROR: [Try 24/1440] Received an API error from the server: detail="can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'" 00:04:53.520 2022-08-18 18:02:43,205 [agent-control/allocate_node] ERROR: [Try 36/1440] Received an API error from the server: detail="can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'" 00:06:07.031 2022-08-18 18:03:56,715 [agent-control/allocate_node] ERROR: [Try 48/1440] Received an API error from the server: detail="can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'" 00:07:21.300 2022-08-18 18:05:10,985 [agent-control/allocate_node] ERROR: [Try 60/1440] Received an API error from the server: detail="can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'" 00:08:30.040 2022-08-18 18:06:19,726 [agent-control/allocate_node] ERROR: [Try 72/1440] Received an API error from the server: detail="can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'" 00:09:39.719 2022-08-18 18:07:29,404 [agent-control/allocate_node] ERROR: [Try 84/1440] Received an API error from the server: detail="can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'" 00:10:52.893 2022-08-18 18:08:42,579 [agent-control/allocate_node] ERROR: [Try 96/1440] Received an API error from the server: detail="can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'" 00:12:10.151 2022-08-18 18:09:59,837 [agent-control/allocate_node] ERROR: [Try 108/1440] Received an API error from the server: detail="can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'" 00:13:26.832 2022-08-18 18:11:16,517 [agent-control/allocate_node] ERROR: [Try 120/1440] Received an API error from the server: detail="can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'" 00:14:43.728 2022-08-18 18:12:33,413 [agent-control/allocate_node] ERROR: [Try 132/1440] Received an API error from the server: detail="can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'" 00:15:59.154 2022-08-18 18:13:48,778 [agent-control/allocate_node] ERROR: [Try 144/1440] Received an API error from the server: detail="can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'" 00:17:10.876 2022-08-18 18:15:00,562 [agent-control/allocate_node] ERROR: [Try 156/1440] Received an API error from the server: detail="can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'" 00:18:17.324 2022-08-18 18:16:07,009 [agent-control/allocate_node] ERROR: [Try 168/1440] Received an API error from the server: detail="can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'" 00:19:23.845 2022-08-18 18:17:13,530 [agent-control/allocate_node] ERROR: [Try 180/1440] Received an API error from the server: detail="can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'" 00:20:34.989 2022-08-18 18:18:24,676 [agent-control/allocate_node] ERROR: [Try 192/1440] Received an API error from the server: detail="can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'" 00:21:42.599 2022-08-18 18:19:32,284 [agent-control/allocate_node] ERROR: [Try 204/1440] Received an API error from the server: detail="can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'" 00:22:49.741 2022-08-18 18:20:39,427 [agent-control/allocate_node] ERROR: [Try 216/1440] Received an API error from the server: detail="can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'" 00:23:56.153 2022-08-18 18:21:45,839 [agent-control/allocate_node] ERROR: [Try 228/1440] Received an API error from the server: detail="can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'" 00:25:05.723 2022-08-18 18:22:55,409 [agent-control/allocate_node] ERROR: [Try 240/1440] Received an API error from the server: detail="can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'" 00:26:16.080 2022-08-18 18:24:05,766 [agent-control/allocate_node] ERROR: [Try 252/1440] Received an API error from the server: detail="can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'" 00:27:25.813 2022-08-18 18:25:15,498 [agent-control/allocate_node] ERROR: [Try 264/1440] Received an API error from the server: detail="can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'" 00:28:36.028 2022-08-18 18:26:25,713 [agent-control/allocate_node] ERROR: [Try 276/1440] Received an API error from the server: detail="can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'" 00:29:44.501 2022-08-18 18:27:34,186 [agent-control/allocate_node] ERROR: [Try 288/1440] Received an API error from the server: detail="can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'" 00:30:51.414 2022-08-18 18:28:41,100 [agent-control/allocate_node] ERROR: [Try 300/1440] Received an API error from the server: detail="can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'" 00:31:48.194 2022-08-18 18:29:37,880 [agent-control/allocate_node] INFO: Allocated node n27-36-189.pool.ci.centos.org with session id 13530
Is this expected? I'm afraid the delay will get even worse when other projects migrate over to the metal machines, since currently it looks only systemd is utilizing the pool (at least given the output of duffy client show-pool metal-ec2-c5n-centos-8s-x86_64).
duffy client show-pool metal-ec2-c5n-centos-8s-x86_64
On an unrelated note - I noticed another session stuck in a "limbo" - the machine is no longer accessible, the pool field is empty, but the session still exists:
pool
(duffy) [systemd@slave01 ~]$ duffy client show-session 16583 { "action": "get", "session": { "retired_at": null, "active": true, "created_at": "2022-08-19T15:00:36.042719+00:00", "id": 16583, "expires_at": "2022-08-19T21:00:36.871262+00:00", "tenant": { "retired_at": null, "active": true, "created_at": "2022-06-23T13:25:27.880793+00:00", "name": "systemd", "is_admin": false, "ssh_key": "**********", "node_quota": null, "session_lifetime": null, "session_lifetime_max": null, "id": 5, "effective_node_quota": 10, "effective_session_lifetime": 21600.0, "effective_session_lifetime_max": 43200.0 }, "data": { "nodes_specs": [ { "quantity": 1, "pool": "metal-ec2-c5n-centos-8s-x86_64" } ] }, "nodes": [ { "hostname": "n27-25-58.pool.ci.centos.org", "ipaddr": "172.27.25.58", "comment": null, "pool": null, "reusable": false, "data": { "provision": { "ec2_instance_id": "i-0b991b4609f71c7df", "ec2_instance_type": "c5n.metal", "hostname": "n27-25-58.pool.ci.centos.org", "ipaddr": "172.27.25.58", "public_hostname": "ec2-54-81-45-130.compute-1.amazonaws.com", "public_ipaddress": "54.81.45.130" }, "nodes_spec": { "quantity": 1, "pool": "metal-ec2-c5n-centos-8s-x86_64" } }, "id": 1448, "state": "done" } ] } }
A quick update: two more machines found stuck in a limbo:
session_id=19618 active=TRUE created_at='2022-08-20 16:40:42.587697+00:00' retired_at= pool= hostname='n27-9-18.pool.ci.centos.org' ipaddr='172.27.9.18' session_id=19658 active=TRUE created_at='2022-08-20 16:42:33.172434+00:00' retired_at= pool= hostname='n27-5-55.pool.ci.centos.org' ipaddr='172.27.5.55'
Also, Duffy has been stuck for two hours provisioning a metal machine (metal-ec2-c5n-centos-8s-x86_64) and the API requests now time out quite frequently, not sure what's going on...
had a quick look at duffy-tasks (provisioner) and it's now tracebacking on something else :
Aug 21 06:02:25 duffy.ci.centos.org duffy[2515230]: kombu.exceptions.OperationalError: MISCONF Redis is configured to save RDB snapshots, but it is currently not able to persist on disk. Commands t> Aug 21 06:02:25 duffy.ci.centos.org duffy[2515230]: During handling of the above exception, another exception occurred: Aug 21 06:02:25 duffy.ci.centos.org duffy[2515230]: Traceback (most recent call last): Aug 21 06:02:25 duffy.ci.centos.org duffy[2515230]: File "/var/lib/duffy/venv/lib64/python3.8/site-packages/celery/beat.py", line 287, in apply_entry Aug 21 06:02:25 duffy.ci.centos.org duffy[2515230]: result = self.apply_async(entry, producer=producer, advance=False) Aug 21 06:02:25 duffy.ci.centos.org duffy[2515230]: File "/var/lib/duffy/venv/lib64/python3.8/site-packages/celery/beat.py", line 412, in apply_async Aug 21 06:02:25 duffy.ci.centos.org duffy[2515230]: reraise(SchedulingError, SchedulingError( Aug 21 06:02:25 duffy.ci.centos.org duffy[2515230]: File "/var/lib/duffy/venv/lib64/python3.8/site-packages/celery/exceptions.py", line 108, in reraise Aug 21 06:02:25 duffy.ci.centos.org duffy[2515230]: raise value.with_traceback(tb) Aug 21 06:02:25 duffy.ci.centos.org duffy[2515230]: File "/var/lib/duffy/venv/lib64/python3.8/site-packages/celery/beat.py", line 404, in apply_async Aug 21 06:02:25 duffy.ci.centos.org duffy[2515230]: return task.apply_async(entry_args, entry_kwargs, Aug 21 06:02:25 duffy.ci.centos.org duffy[2515230]: File "/var/lib/duffy/venv/lib64/python3.8/site-packages/celery/app/task.py", line 575, in apply_async Aug 21 06:02:25 duffy.ci.centos.org duffy[2515230]: return app.send_task( Aug 21 06:02:25 duffy.ci.centos.org duffy[2515230]: File "/var/lib/duffy/venv/lib64/python3.8/site-packages/celery/app/base.py", line 788, in send_task Aug 21 06:02:25 duffy.ci.centos.org duffy[2515230]: amqp.send_task_message(P, name, message, **options) Aug 21 06:02:25 duffy.ci.centos.org duffy[2515230]: File "/var/lib/duffy/venv/lib64/python3.8/site-packages/celery/app/amqp.py", line 510, in send_task_message Aug 21 06:02:25 duffy.ci.centos.org duffy[2515230]: ret = producer.publish( Aug 21 06:02:25 duffy.ci.centos.org duffy[2515230]: File "/var/lib/duffy/venv/lib64/python3.8/site-packages/kombu/messaging.py", line 177, in publish Aug 21 06:02:25 duffy.ci.centos.org duffy[2515230]: return _publish( Aug 21 06:02:25 duffy.ci.centos.org duffy[2515230]: File "/var/lib/duffy/venv/lib64/python3.8/site-packages/kombu/connection.py", line 556, in _ensured Aug 21 06:02:25 duffy.ci.centos.org duffy[2515230]: errback and errback(exc, 0) Aug 21 06:02:25 duffy.ci.centos.org duffy[2515230]: File "/usr/lib64/python3.8/contextlib.py", line 131, in __exit__ Aug 21 06:02:25 duffy.ci.centos.org duffy[2515230]: self.gen.throw(type, value, traceback) Aug 21 06:02:25 duffy.ci.centos.org duffy[2515230]: File "/var/lib/duffy/venv/lib64/python3.8/site-packages/kombu/connection.py", line 450, in _reraise_as_library_errors Aug 21 06:02:25 duffy.ci.centos.org duffy[2515230]: raise ConnectionError(str(exc)) from exc Aug 21 06:02:25 duffy.ci.centos.org duffy[2515230]: celery.beat.SchedulingError: Couldn't apply scheduled task duffy.tasks.provision.fill_pools(): MISCONF Redis is configured to save RDB snapshots
@nphilipp ^
Also : I guess it would be better to report all Duffy issues on his dedicated tracker (https://github.com/CentOS/duffy/issues) or just continuing here ?
Just quick update : that one is caused again by this reported issue. I see it was closed by @nphilipp upstream but not yet in a release so not yet deployed on duffy.ci.centos.org
Thanks a lot for today's quick fix!
That makes sense, I filed the machine-stuck-in-a-limbo issue as https://github.com/CentOS/duffy/issues/558.
I've just released 3.3.4 with the accrued fixes and updated Duffy on n23:
[nphilipp@n23 ~]$ duffy -c .config/duffy client show-pool virt-one-small-centos-9s-x86_64 { "action": "get", "pool": { "name": "virt-one-small-centos-9s-x86_64", "fill_level": 3, "levels": { "provisioning": 0, "ready": 3, "contextualizing": 0, "deployed": 0, "deprovisioning": 0 } } } [nphilipp@n23 ~]$ for i in 1 2; do duffy -c ~/.config/duffy/ client request-session pool=virt-one-sma$ l-centos-9s-x86_64,quantity=1 > duffy-$i.out & done; wait [1] 3938193 [2] 3938194 [1]- Done duffy -c ~/.config/duffy/ client request-session pool=virt-one-small-ce$ tos-9s-x86_64,quantity=1 > duffy-$i.out [2]+ Done duffy -c ~/.config/duffy/ client request-session pool=virt-one-small-ce$ tos-9s-x86_64,quantity=1 > duffy-$i.out [nphilipp@n23 ~]$ diff -u duffy-{1,2}.out --- duffy-1.out 2022-08-22 08:30:08.529854474 +0000 +++ duffy-2.out 2022-08-22 08:30:07.485906704 +0000 @@ -3,9 +3,9 @@ "session": { "retired_at": null, "active": true, - "created_at": "2022-08-22T08:30:07.870737+00:00", - "id": 62, - "expires_at": "2022-08-22T14:30:07.871686+00:00", + "created_at": "2022-08-22T08:30:06.249896+00:00", + "id": 60, + "expires_at": "2022-08-22T14:30:06.542881+00:00", "tenant": { "retired_at": null, "active": true, @@ -31,24 +31,24 @@ }, "nodes": [ { - "hostname": "n50-74.cloud.ci.centos.org", - "ipaddr": "10.234.50.74", + "hostname": "n50-73.cloud.ci.centos.org", + "ipaddr": "10.234.50.73", "comment": null, "pool": "virt-one-small-centos-9s-x86_64", "reusable": false, "data": { "provision": { "opennebula": { - "id": 169 + "id": 168 }, - "ipaddr": "10.234.50.74" + "ipaddr": "10.234.50.73" }, "nodes_spec": { "quantity": 1, "pool": "virt-one-small-centos-9s-x86_64" } }, - "id": 383, + "id": 382, "state": "deployed" } ] [nphilipp@n23 ~]$
@arrfab would you please install the update in prod? Also, users of duffy client experiencing this issue should update to 3.3.4, too, because previous versions might run into timeouts on the 2nd (3rd, ...) parallel request if the API retries the DB transaction – the httpx package used has a default request timeout of 5 seconds which might be too short for this case, 3.3.4 sets the timeout to infinite.
duffy client
httpx
Pushed in git/inventory and applied by ansible so duffy.ci.centos.org is now running 3.3.4 In parallel, the cico-workspace container was rebuilt on quay.io and so will be also used for next pods/jobs on jenkins instances (to reflect also that specific duffy[client] version)
cico-workspace
@mrc0mmand : do you see any improvement in the last ~24h ? We'd like to hear from you and get some feedback and (eventually) close this ticket if that's now resolved :)
Everything is looking good from my side - the API appears to be stable and I haven't seen any stuck session. I'll go ahead and close this ticket, since it's kinda long already and all relevant issues have been either resolved or reported to the GH issue tracker.
Thanks a bunch to everyone involved!
Metadata Update from @mrc0mmand: - Issue close_status updated to: Fixed - Issue status updated to: Closed (was: Open)
Login to comment on this ticket.