Today I noticed a couple of failed runs in jobs which requested CentOS 8 Stream machines (x86_64, baremetal), all of them complaining about missing dnf. Looks like Duffy is either tagging C7 machines as C8S, or the machines are really messed up:
2021-12-12 16:49:59,048 [agent-control/allocate_node] INFO: Attempting to allocate a node (version: 8-stream, arch: x86_64, flavor: n/a) 2021-12-12 16:49:59,048 [agent-control/_execute_api_command] INFO: Duffy request URL: http://admin.ci.centos.org:8080/Node/get 2021-12-12 16:49:59,054 [connectionpool/_new_conn] INFO: Starting new HTTP connection (1): admin.ci.centos.org 2021-12-12 16:50:00,333 [agent-control/allocate_node] INFO: Successfully allocated node 'n15.pufty' (1a79b802) 2021-12-12 16:50:00,333 [agent-control/main] INFO: PHASE 1: Setting up basic dependencies to configure CI repository 2021-12-12 16:50:00,334 [agent-control/execute_remote_command] INFO: Executing a REMOTE command on node 'n15.pufty': dnf clean all && dnf makecache && dnf -y install bash git rsync && rm -fr systemd-centos-ci && git clone https://github.com/systemd/systemd-centos-ci 2021-12-12 16:50:00,334 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ssh -t -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ConnectTimeout=180 -o TCPKeepAlive=yes -o ServerAliveInterval=2 -l root n15.pufty dnf clean all && dnf makecache && dnf -y install bash git rsync && rm -fr systemd-centos-ci && git clone https://github.com/systemd/systemd-centos-ci Warning: Permanently added 'n15.pufty,172.19.3.79' (ECDSA) to the list of known hosts. bash: dnf: command not found Connection to n15.pufty closed. 2021-12-12 16:50:01,331 [agent-control/main] ERROR: Execution failed
Metadata Update from @arrfab: - Issue tagged with: centos-ci-infra, high-gain, medium-trouble
Metadata Update from @arrfab: - Issue assigned to arrfab
Based on discussion on irc, it appears that it would be mostly limited to "pufty" chassis, and we already had issue in the past. So while ansible use the uri module to interact with seamicro chassis, it seems that it's answering '200' but not doing anything (to be confirmed if that's the "usual" issue we had in the past)
uri
I've marked all nodes in that chassis as "Investigating" meaning that duffy shouldn't even try to hand these over. I'll wait for other workload to finish (I saw some "Deployed" nodes with centos 7) and I'll reset the whole management interface. Once that will be working (manual ansible-playbook call to reinstall a node), I'll mark all nodes as "Ready", meaning that they'll be then reinstalled by provisioner.
Waiting on feedback from @mrc0mmand to see if that's working now at least
Just a quick update : as I put all these nodes into "investigation", I proceeded with a reset of the pufty seamicro chassis and I'll test some deployments on that chassis once it will be back online and then put it back into Ready pool after confirmation
The api management controller of that seamicro chassis is back, and I kicked ansible to deploy 8-stream as a test on one node and it worked fine. I have now put back nodes as 'Active' meaning that I see Duffy started to provision needed instance to have them 'Ready' in the pool.
Closing for now
Metadata Update from @arrfab: - Issue close_status updated to: Fixed with Explanation - Issue status updated to: Closed (was: Open)
Login to comment on this ticket.