#1204 Duffy nodes stuck in provisioning/contextualizing loop
Closed: Fixed 2 years ago by arrfab. Opened 2 years ago by mrc0mmand.

Hey!

Yesterday I noticed that our CentOS CI jobs can't get any node. Today the situation is the same and looking at the output of duffy client show-pool it looks like no nodes are reaching the "ready" state, instead they seem to be stuck in the provisioning/contextualizing state:

$ duffy client show-pool virt-ec2-t2-centos-8s-x86_64
{
  "action": "get",
  "pool": {
    "name": "virt-ec2-t2-centos-8s-x86_64",
    "fill_level": 10,
    "levels": {
      "provisioning": 10,
      "ready": 0,
      "contextualizing": 17,
      "deployed": 0,
      "deprovisioning": 0
    }
  }
}

$ duffy client show-pool virt-ec2-t2-centos-9s-x86_64
{
  "action": "get",
  "pool": {
    "name": "virt-ec2-t2-centos-9s-x86_64",
    "fill_level": 8,
    "levels": {
      "provisioning": 8,
      "ready": 0,
      "contextualizing": 6,
      "deployed": 2,
      "deprovisioning": 0
    }
  }
}

$ duffy client show-pool metal-ec2-c5n-centos-8s-x86_64
{
  "action": "get",
  "pool": {
    "name": "metal-ec2-c5n-centos-8s-x86_64",
    "fill_level": 5,
    "levels": {
      "provisioning": 5,
      "ready": 0,
      "contextualizing": 7,
      "deployed": 0,
      "deprovisioning": 0
    }
  }
}

Any idea what's going on?

Thanks!


Metadata Update from @arrfab:
- Issue assigned to arrfab

2 years ago

Metadata Update from @arrfab:
- Issue priority set to: 🔥 Urgent 🔥 (was: Needs Review)
- Issue tagged with: centos-ci-infra, high-gain, high-trouble

2 years ago

We got confirmation from other tenants in the #centos-ci irc channel on libera.chat and I just had a quick look this morning : it seems duffy-tasks app was stuck and so unable to contextuatlize nor even deploy new instances.
I'll see with @nphilipp when he'll be available online but in the meantime, I just decided to :

  • stop duffy
  • clean-up the nodes in state provisioning
  • restart duffy

It then was able to reprovision some ec2 nodes and some other tenants confirmed on irc that it wa then working for them

At my side :

duffy client show-pool virt-ec2-t2-centos-8s-x86_64
{
  "action": "get",
  "pool": {
    "name": "virt-ec2-t2-centos-8s-x86_64",
    "fill_level": 10,
    "levels": {
      "provisioning": 0,
      "ready": 10,
      "contextualizing": 17,
      "deployed": 2,
      "deprovisioning": 0
    }
  }
}

So while it seems now "fixed" I'll see with Nils as we have plenty of nodes in contextualizing state so I'll see what to do about these ones.

My first idea would be (and so announce it as it will be again a small downtime) :

  • stop duffy
  • clean all nodes from duffy DB (start from clean state)
  • terminate all ec2 instances (for the "start from clean state)
  • restart duffy so that it would provision again needed pools

We investigated with @nphilipp what could have been the issue as duffy worker was stuck with a traceback :

Jun 26 05:00:06 duffy.ci.centos.org duffy[2318765]:     return self.mechanism.provision(nodes=nodes)
Jun 26 05:00:06 duffy.ci.centos.org duffy[2318765]:   File "/var/lib/duffy/venv/lib64/python3.8/site-packages/duffy/nodes/mechanisms/ansible.py", line 92, in provision
Jun 26 05:00:06 duffy.ci.centos.org duffy[2318765]:     return self.run_playbook(
Jun 26 05:00:06 duffy.ci.centos.org duffy[2318765]:   File "/var/lib/duffy/venv/lib64/python3.8/site-packages/duffy/nodes/mechanisms/ansible.py", line 44, in run_playbook
Jun 26 05:00:06 duffy.ci.centos.org duffy[2318765]:     with TemporaryDirectory() as tmpdir:
Jun 26 05:00:06 duffy.ci.centos.org duffy[2318765]:   File "/usr/lib64/python3.8/tempfile.py", line 780, in __init__
Jun 26 05:00:06 duffy.ci.centos.org duffy[2318765]:     self.name = mkdtemp(suffix, prefix, dir)
Jun 26 05:00:06 duffy.ci.centos.org duffy[2318765]:   File "/usr/lib64/python3.8/tempfile.py", line 358, in mkdtemp
Jun 26 05:00:06 duffy.ci.centos.org duffy[2318765]:     _os.mkdir(file, 0o700)
Jun 26 05:00:06 duffy.ci.centos.org duffy[2318765]: FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmplga68zf0'

It seems the VM itself had suddenly issue (for some hours) to write data on its own disks/filesystem (confirmed by other OS commands)
The only thing we found was a raid-check happening on the underlying hypervisor that could have impacted IOs from the VM to the real disks (software raid1) :

[Sun Jun 25 01:59:35 2023] md: data-check of RAID array md126
[Sun Jun 25 01:59:41 2023] md: delaying data-check of md127 until md126 has finished (they share one or more physical units)
[Sun Jun 25 02:19:05 2023] md: md126: data-check done.
[Sun Jun 25 02:19:05 2023] md: data-check of RAID array md127
[Sun Jun 25 02:19:11 2023] md: md127: data-check done.

Even if we fixed the actual issue today, we are now in a state when Duffy DB doesn't match AWS state, with some "left behind" ec2 instances. So we have just announced a small downtime (see https://lists.centos.org/pipermail/ci-users/2023-June/005956.html) to start from a clean state again

Closing as we proceeded with maintenance window to also update duffy to version 3.3.7

Metadata Update from @arrfab:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

2 years ago

Log in to comment on this ticket.

Metadata