#558 Incorrectly tagged machines in the Duffy pool
Closed: Fixed with Explanation 2 years ago by arrfab. Opened 2 years ago by mrc0mmand.

Today I noticed a couple of failed runs in jobs which requested CentOS 8 Stream machines (x86_64, baremetal), all of them complaining about missing dnf. Looks like Duffy is either tagging C7 machines as C8S, or the machines are really messed up:

2021-12-12 16:49:59,048 [agent-control/allocate_node] INFO: Attempting to allocate a node (version: 8-stream, arch: x86_64, flavor: n/a)
2021-12-12 16:49:59,048 [agent-control/_execute_api_command] INFO: Duffy request URL: http://admin.ci.centos.org:8080/Node/get
2021-12-12 16:49:59,054 [connectionpool/_new_conn] INFO: Starting new HTTP connection (1): admin.ci.centos.org
2021-12-12 16:50:00,333 [agent-control/allocate_node] INFO: Successfully allocated node 'n15.pufty' (1a79b802)
2021-12-12 16:50:00,333 [agent-control/main] INFO: PHASE 1: Setting up basic dependencies to configure CI repository
2021-12-12 16:50:00,334 [agent-control/execute_remote_command] INFO: Executing a REMOTE command on node 'n15.pufty': dnf clean all && dnf makecache && dnf -y install bash git rsync && rm -fr systemd-centos-ci && git clone https://github.com/systemd/systemd-centos-ci
2021-12-12 16:50:00,334 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ssh -t -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ConnectTimeout=180 -o TCPKeepAlive=yes -o ServerAliveInterval=2 -l root n15.pufty dnf clean all && dnf makecache && dnf -y install bash git rsync && rm -fr systemd-centos-ci && git clone https://github.com/systemd/systemd-centos-ci
Warning: Permanently added 'n15.pufty,172.19.3.79' (ECDSA) to the list of known hosts.
bash: dnf: command not found
Connection to n15.pufty closed.
2021-12-12 16:50:01,331 [agent-control/main] ERROR: Execution failed

Metadata Update from @arrfab:
- Issue tagged with: centos-ci-infra, high-gain, medium-trouble

2 years ago

Metadata Update from @arrfab:
- Issue assigned to arrfab

2 years ago

Based on discussion on irc, it appears that it would be mostly limited to "pufty" chassis, and we already had issue in the past.
So while ansible use the uri module to interact with seamicro chassis, it seems that it's answering '200' but not doing anything (to be confirmed if that's the "usual" issue we had in the past)

I've marked all nodes in that chassis as "Investigating" meaning that duffy shouldn't even try to hand these over. I'll wait for other workload to finish (I saw some "Deployed" nodes with centos 7) and I'll reset the whole management interface. Once that will be working (manual ansible-playbook call to reinstall a node), I'll mark all nodes as "Ready", meaning that they'll be then reinstalled by provisioner.

Waiting on feedback from @mrc0mmand to see if that's working now at least

Just a quick update : as I put all these nodes into "investigation", I proceeded with a reset of the pufty seamicro chassis and I'll test some deployments on that chassis once it will be back online and then put it back into Ready pool after confirmation

The api management controller of that seamicro chassis is back, and I kicked ansible to deploy 8-stream as a test on one node and it worked fine.
I have now put back nodes as 'Active' meaning that I see Duffy started to provision needed instance to have them 'Ready' in the pool.

Closing for now

Metadata Update from @arrfab:
- Issue close_status updated to: Fixed with Explanation
- Issue status updated to: Closed (was: Open)

2 years ago

Login to comment on this ticket.

Metadata
Boards 1
CentOS CI Infra Status: Backlog