Hey!
Yesterday I noticed that our CentOS CI jobs can't get any node. Today the situation is the same and looking at the output of duffy client show-pool it looks like no nodes are reaching the "ready" state, instead they seem to be stuck in the provisioning/contextualizing state:
duffy client show-pool
$ duffy client show-pool virt-ec2-t2-centos-8s-x86_64 { "action": "get", "pool": { "name": "virt-ec2-t2-centos-8s-x86_64", "fill_level": 10, "levels": { "provisioning": 10, "ready": 0, "contextualizing": 17, "deployed": 0, "deprovisioning": 0 } } } $ duffy client show-pool virt-ec2-t2-centos-9s-x86_64 { "action": "get", "pool": { "name": "virt-ec2-t2-centos-9s-x86_64", "fill_level": 8, "levels": { "provisioning": 8, "ready": 0, "contextualizing": 6, "deployed": 2, "deprovisioning": 0 } } } $ duffy client show-pool metal-ec2-c5n-centos-8s-x86_64 { "action": "get", "pool": { "name": "metal-ec2-c5n-centos-8s-x86_64", "fill_level": 5, "levels": { "provisioning": 5, "ready": 0, "contextualizing": 7, "deployed": 0, "deprovisioning": 0 } } }
Any idea what's going on?
Thanks!
Metadata Update from @arrfab: - Issue assigned to arrfab
Metadata Update from @arrfab: - Issue priority set to: 🔥 Urgent 🔥 (was: Needs Review) - Issue tagged with: centos-ci-infra, high-gain, high-trouble
We got confirmation from other tenants in the #centos-ci irc channel on libera.chat and I just had a quick look this morning : it seems duffy-tasks app was stuck and so unable to contextuatlize nor even deploy new instances. I'll see with @nphilipp when he'll be available online but in the meantime, I just decided to :
provisioning
It then was able to reprovision some ec2 nodes and some other tenants confirmed on irc that it wa then working for them
At my side :
duffy client show-pool virt-ec2-t2-centos-8s-x86_64 { "action": "get", "pool": { "name": "virt-ec2-t2-centos-8s-x86_64", "fill_level": 10, "levels": { "provisioning": 0, "ready": 10, "contextualizing": 17, "deployed": 2, "deprovisioning": 0 } } }
So while it seems now "fixed" I'll see with Nils as we have plenty of nodes in contextualizing state so I'll see what to do about these ones.
contextualizing
My first idea would be (and so announce it as it will be again a small downtime) :
We investigated with @nphilipp what could have been the issue as duffy worker was stuck with a traceback :
Jun 26 05:00:06 duffy.ci.centos.org duffy[2318765]: return self.mechanism.provision(nodes=nodes) Jun 26 05:00:06 duffy.ci.centos.org duffy[2318765]: File "/var/lib/duffy/venv/lib64/python3.8/site-packages/duffy/nodes/mechanisms/ansible.py", line 92, in provision Jun 26 05:00:06 duffy.ci.centos.org duffy[2318765]: return self.run_playbook( Jun 26 05:00:06 duffy.ci.centos.org duffy[2318765]: File "/var/lib/duffy/venv/lib64/python3.8/site-packages/duffy/nodes/mechanisms/ansible.py", line 44, in run_playbook Jun 26 05:00:06 duffy.ci.centos.org duffy[2318765]: with TemporaryDirectory() as tmpdir: Jun 26 05:00:06 duffy.ci.centos.org duffy[2318765]: File "/usr/lib64/python3.8/tempfile.py", line 780, in __init__ Jun 26 05:00:06 duffy.ci.centos.org duffy[2318765]: self.name = mkdtemp(suffix, prefix, dir) Jun 26 05:00:06 duffy.ci.centos.org duffy[2318765]: File "/usr/lib64/python3.8/tempfile.py", line 358, in mkdtemp Jun 26 05:00:06 duffy.ci.centos.org duffy[2318765]: _os.mkdir(file, 0o700) Jun 26 05:00:06 duffy.ci.centos.org duffy[2318765]: FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmplga68zf0'
It seems the VM itself had suddenly issue (for some hours) to write data on its own disks/filesystem (confirmed by other OS commands) The only thing we found was a raid-check happening on the underlying hypervisor that could have impacted IOs from the VM to the real disks (software raid1) :
[Sun Jun 25 01:59:35 2023] md: data-check of RAID array md126 [Sun Jun 25 01:59:41 2023] md: delaying data-check of md127 until md126 has finished (they share one or more physical units) [Sun Jun 25 02:19:05 2023] md: md126: data-check done. [Sun Jun 25 02:19:05 2023] md: data-check of RAID array md127 [Sun Jun 25 02:19:11 2023] md: md127: data-check done.
Even if we fixed the actual issue today, we are now in a state when Duffy DB doesn't match AWS state, with some "left behind" ec2 instances. So we have just announced a small downtime (see https://lists.centos.org/pipermail/ci-users/2023-June/005956.html) to start from a clean state again
Closing as we proceeded with maintenance window to also update duffy to version 3.3.7
Metadata Update from @arrfab: - Issue close_status updated to: Fixed - Issue status updated to: Closed (was: Open)
Log in to comment on this ticket.