#1211 Duffy nodes stuck in provisioning/contextualizing loop #2
Closed: Fixed 8 months ago by mrc0mmand. Opened 11 months ago by mrc0mmand.

Hey,

Looks like https://pagure.io/centos-infra/issue/1204 resurfaced again and Duffy fails to provision new machines:

{
  "action": "get",
  "pool": {
    "name": "metal-ec2-c5n-centos-8s-x86_64",
    "fill_level": 5,
    "levels": {
      "provisioning": 4,
      "ready": 0,
      "contextualizing": 7,
      "deployed": 3,
      "deprovisioning": 0
    }
  }
}
{
  "action": "get",
  "pool": {
    "name": "virt-ec2-t2-centos-8s-x86_64",
    "fill_level": 10,
    "levels": {
      "provisioning": 3,
      "ready": 7,
      "contextualizing": 18,
      "deployed": 3,
      "deprovisioning": 0
    }
  }
}

I had a watch on duffy client show-pool for a while and the provisioning number goes up and down, so Duffy seems to be doing something, but the nodes never reach the "ready" state (or any other state for that matter).


For metal ones, It's because of the current availability of these machines at the AWS side.
When duffy calls ansible-playbook to provision, the AWS api returns :

We currently do not have sufficient c5n.metal capacity in the Availability Zone . Our system will be working on provisioning additional capacity

@nphilipp : can we have an impact with something like this ?

Jun 30 14:53:22 duffy.ci.centos.org duffy[3335654]: /var/lib/duffy/venv/lib64/python3.8/site-packages/duffy/database/model/node.py:14: SAWarning: Attribute 'index_uniqueness_clause' on class <class 'duffy.database.model.node.Node'> appears to be a non-schema SQLAlchemy expression object; this won't be part of the declarative mapping. To map arbitrary expressions, use ``column_property()`` or a similar function such as ``deferred()``, ``query_expression()`` etc.
Jun 30 14:53:22 duffy.ci.centos.org duffy[3335654]:   class Node(Base, CreatableMixin, RetirableMixin):

Don't know about the workers but it seems that we have machines in "provisioning" state but I don't see any ansible activity so some machines are in "limbo" state : duffy seems to think that it's being provisioned but nothing is processed .. or maybe ec2 instances were provisioned but not in db and so state is unknown ?

During weekend I cleared out the ones in provisioning or contextualizing state and so it reprovisioned some instances but today I'm looking at the stats and some are in provisioning state again while there is nothing that is actually running (from ansible PoV) so it's like if Duffy is lost somewhere. @nphilipp can you investigate ?

In the meantime I'll have to create an adhoc task that would query duffy DB for such state (that I modified to failed and retrieve the ec2 instance ID list to then delete at the AWS level and not have plenty of running ec2 that duffy will never hand over to tenants :/ )

Metadata Update from @arrfab:
- Issue assigned to nphilipp
- Issue tagged with: centos-ci-infra, high-gain, high-trouble, investigation

10 months ago

Our jobs were getting timed out waiting to reserve from metal-ec2-c5n-centos-8s-x86_64 pool.

{
  "action": "get",
  "pool": {
    "name": "metal-ec2-c5n-centos-8s-x86_64",
    "fill-level": 5,
    "levels": {
      "provisioning": 4,
      "ready": 1,
      "contextualizing": 1,
      "deployed": 0,
      "deprovisioning": 0
    }
  }
}

I haven't been able to get any machine from the metal-ec2-c5n-centos-9s-x86_64 pool today, there's one machine apparently stuck in the provisioning state:

{
  "action": "get",
  "pool": {
    "name": "metal-ec2-c5n-centos-9s-x86_64",
    "fill_level": 1,
    "levels": {
      "provisioning": 1,
      "ready": 0,
      "contextualizing": 0,
      "deployed": 0,
      "deprovisioning": 0
    }
  }
}

I'm not sure if I've been just extremely unlucky or there's something else going on.

@mrc0mmand let me (again) abuse a directl sql statement in postgresql db but clearly something is going wrong. @nphilipp will investigate with @dkirwan (myself being in away mode soon)

for the record:

duffy=> update nodes set state='failed' where state='provisioning' and pool like '%ec2%';
UPDATE 12
duffy=> update nodes set state='failed' where state='contextualizing' and pool like '%ec2%';
UPDATE 9

I'll also bump the "fill_level" from 1 to 3 (and lowered from 5 to 3 for stream 8).
In the past the most requested one was 8s but let's rebalance (we'll still have the limit with number of metal c5n instances in that region though but out of our control)

{
  "action": "get",
  "pool": {
    "name": "metal-ec2-c5n-centos-9s-x86_64",
    "fill_level": 3,
    "levels": {
      "provisioning": 2,
      "ready": 1,
      "contextualizing": 0,
      "deployed": 0,
      "deprovisioning": 0
    }
  }
}

I think issue has popped up again:

{
  "action": "get",
  "pool": {
    "name": "metal-ec2-c5n-centos-8s-x86_64",
    "fill-level": 3,
    "levels": {
      "provisioning": 3,
      "ready": 0,
      "contextualizing": 1,
      "deployed": 0,
      "deprovisioning": 0
    }
  }
}

Yeah, everything seems to be stuck again.

HI, I repeated the commands @arrfab shared:

duffy=# update nodes set state='failed' where state='provisioning' and pool like '%ec2%';
UPDATE 12
duffy=# update nodes set state='failed' where state='contextualizing' and pool like '%ec2%';
UPDATE 5
duffy=# 

@anoopcs and @mrc0mmand did this help at all here?

So far it looks good, thanks!

Aaaand as already mentioned on IRC the issue appears to be back:

$ duffy client show-pool metal-ec2-c5n-centos-8s-x86_64
{
  "action": "get",
  "pool": {
    "name": "metal-ec2-c5n-centos-8s-x86_64",
    "fill_level": 3,
    "levels": {
      "provisioning": 3,
      "ready": 0,
      "contextualizing": 0,
      "deployed": 0,
      "deprovisioning": 0
    }
  }
}

Not sure if someone talked to the DB again or the issue fixed itself, but the pools seems to be back to normal, so thanks :)

duffy=> update nodes set state='failed' where state='provisioning' and pool like '%ec2%';
UPDATE 9
duffy=> update nodes set state='failed' where state='contextualizing' and pool like '%ec2%';
UPDATE 2

Kicked.

Currently blocked on a permanent fix until I can get time with @nphilipp to debug with me!

Also, there appears to be some other (or related) issue where when retiring session the API confirms the session was retired, but, for some reason, the session is still in ACTIVE state (thus counting towards the session quota). This happens very rarely (or I assume so, it's hard to tell, as the maximum session life time helps with cleaning up the evidence).

The last occurrence I accidentally came across is:

2023-06-17 13:59:26,899 [_client/_send_single_request] INFO: HTTP Request: GET https://duffy.ci.centos.org/api/v1/sessions/198087 "HTTP/1.1 200 OK"
2023-06-17 13:59:26,900 [agent-control/show_session] INFO: Session: 198087
2023-06-17 13:59:26,900 [agent-control/show_session] INFO: Session owner: systemd
2023-06-17 13:59:26,900 [agent-control/show_session] INFO: Session active state: True
2023-06-17 13:59:26,900 [agent-control/show_session] INFO: Session retired at: None
2023-06-17 13:59:26,900 [agent-control/show_session] INFO: First session node: n27-26-79.pool.ci.centos.org
2023-06-17 13:59:26,900 [agent-control/free_session] INFO: [Try 1/10] Freeing session 198087 (with node n27-26-79.pool.ci.centos.org)
2023-06-17 13:59:27,258 [_client/_send_single_request] INFO: HTTP Request: PUT https://duffy.ci.centos.org/api/v1/sessions/198087 "HTTP/1.1 200 OK"
2023-06-17 13:59:28,258 [agent-control/free_session] INFO: [Try 2/10] Freeing session 198087 (with node n27-26-79.pool.ci.centos.org)
2023-06-17 13:59:28,641 [_client/_send_single_request] INFO: HTTP Request: PUT https://duffy.ci.centos.org/api/v1/sessions/198087 "HTTP/1.1 422 Unprocessable Entity"
2023-06-17 13:59:28,641 [agent-control/free_session] INFO: Session 198087 was successfully freed

Looks like the virt-ec2-t2-centos-8s-x86_64 pool is busted this time:

{
  "action": "get",
  "pool": {
    "name": "virt-ec2-t2-centos-8s-x86_64",
    "fill_level": 10,
    "levels": {
      "provisioning": 10,
      "ready": 0,
      "contextualizing": 2,
      "deployed": 0,
      "deprovisioning": 0
    }
  }
}

Reran:

duffy=> update nodes set state='failed' where state='provisioning' and pool like '%ec2%';
UPDATE 15
duffy=> update nodes set state='failed' where state='contextualizing' and pool like '%ec2%';
UPDATE 4

I'm meeting up today with @nphilipp to debug this issue, so fingers crossed we'll either have a fix or have a better idea where to look for one!

Reran:

duffy=> update nodes set state='failed' where state='provisioning' and pool like '%ec2%'; UPDATE 15 duffy=> update nodes set state='failed' where state='contextualizing' and pool like '%ec2%'; UPDATE 4

Looks like we're indeed back in business, thank you!

I'm meeting up today with @nphilipp to debug this issue, so fingers crossed we'll either have a fix or have a better idea where to look for one!

Good luck, always happy to help with testing/debugging when needed!

I can send on a meeting invite if you wanna join? :D I'll grab your email from FAS, and add you to the invite.

I can send on a meeting invite if you wanna join? :D I'll grab your email from FAS, and add you to the invite.

I'm afraid my today/tomorrow is already filled to the brim with meetings, so the best I can do is to watch from the side :D

no worries! Happy for you to just tag along!

Any idea what's going on with the metal-ec2-c5n-centos-8s-x86_64 pool? It's been stuck in provisioning for the past ~3 hours and I'm not sure if it's the issue showing again or the AWS is simply out of machines (I can't really tell from duffy's current output):

{
  "action": "get",
  "pool": {
    "name": "metal-ec2-c5n-centos-8s-x86_64",
    "fill_level": 3,
    "levels": {
      "provisioning": 3,
      "ready": 0,
      "contextualizing": 0,
      "deployed": 0,
      "deprovisioning": 0
    }
  }
}

Looks like the issue is only with the metal-ec2-c5n-centos-8s-x86_64 pool, other pools seem to work fine.

@dkirwan @nphilipp @arrfab Any chance of having the SQL command run before EOD? We're preparing a release, so any help would be greatly appreciated :)

Update from our troubleshooting last night.

We were unable to track down the cause of the provisioning error in the logs, but while investigating the AWS side we discovered that there were 150~ instances related to Duffy in AWS which were not cleaned up. This could have been hitting instance limits on the AWSaccount.

We've cleaned up these instances manually, and @nphilipp has made a number of changes which might help going forward. We've also identified a number of other fixes that should be made to Duffy.

duffy=> update nodes set state='failed' where state='provisioning' and pool like '%ec2%';
UPDATE 0
duffy=> update nodes set state='failed' where state='contextualizing' and pool like '%ec2%';
UPDATE 0

Seems to be all good judging by those 0's. Hopefully all good @mrc0mmand

duffy=> update nodes set state='failed' where state='provisioning' and pool like '%ec2%'; UPDATE 0 duffy=> update nodes set state='failed' where state='contextualizing' and pool like '%ec2%'; UPDATE 0

Seems to be all good judging by those 0's. Hopefully all good @mrc0mmand

Yup, so far everything seems to work well, thanks!

Looks like the cloud is back at doing cloud things, as there's a lot of machines stuck in contextualizing and no new machines are being provisioned:

{
  "action": "get",
  "pool": {
    "name": "metal-ec2-c5n-centos-8s-x86_64",
    "fill_level": 3,
    "levels": {
      "provisioning": 3,
      "ready": 0,
      "contextualizing": 11,
      "deployed": 0,
      "deprovisioning": 0
    }
  }
}

used again the workaround to unblock ci tenants but I'll try to sync with @nphilipp and @dkirwan (some are attending Flock event this week) about next steps to fix underlying issue

@nphilipp and myself had a look and a new duffy release (v3.3.9) was made available by Nils.
It was applied earlier today (announced on ci-users list and #centos-ci irc channel) and we hope that it will fix the behaviour some of you were seeing.
Closing for now but feel free to reopen if similar issue seems to be back

Metadata Update from @arrfab:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

9 months ago

Looks like the issue is back (or at least the symptoms are the same - machines are stuck in provisioning & contextualizing):

$ duffy client show-pool metal-ec2-c5n-centos-8s-x86_64
{
  "action": "get",
  "pool": {
    "name": "metal-ec2-c5n-centos-8s-x86_64",
    "fill_level": 3,
    "levels": {
      "provisioning": 3,
      "ready": 0,
      "contextualizing": 9,
      "deployed": 0,
      "deprovisioning": 0
    }
  }
}
$ duffy client show-pool virt-ec2-t2-centos-8s-x86_64
{
  "action": "get",
  "pool": {
    "name": "virt-ec2-t2-centos-8s-x86_64",
    "fill_level": 10,
    "levels": {
      "provisioning": 0,
      "ready": 10,
      "contextualizing": 13,
      "deployed": 0,
      "deprovisioning": 0
    }
  }
}

Metadata Update from @mrc0mmand:
- Issue status updated to: Open (was: Closed)

9 months ago

Looks like now the machines are forever stuck in the "deployed" state :(

  "action": "get",
  "pool": {
    "name": "metal-ec2-c5n-centos-8s-x86_64",
    "fill_level": 3,
    "levels": {
      "provisioning": 3,
      "ready": 0,
      "contextualizing": 0,
      "deployed": 9,
      "deprovisioning": 0
    }
  }
}

@dkirwan @nphilipp @arrfab any chance of getting them unstuck somehow so they don't block resources?

@dkirwan @nphilipp @arrfab any chance of getting them unstuck somehow so they don't block resources?

We are also stuck without any nodes in ready state. Can we have the usual workaround in place again?

Hi there @anoopcs we've just run our workaround there, and manually cleaned up a number of instances on AWS.

Can you check back shortly.

duffy=> update nodes set state='failed' where state='provisioning' and pool like '%ec2%';
UPDATE 5
duffy=> update nodes set state='failed' where state='contextualizing' and pool like '%ec2%';
UPDATE 0

Hi there @anoopcs we've just run our workaround there, and manually cleaned up a number of instances on AWS.

Can you check back shortly.

duffy=> update nodes set state='failed' where state='provisioning' and pool like '%ec2%'; UPDATE 5 duffy=> update nodes set state='failed' where state='contextualizing' and pool like '%ec2%'; UPDATE 0

Yes, I can now see nodes in ready state.

@dkirwan @nphilipp Looks like the issue is back again :(

duffy=> update nodes set state='failed' where state='provisioning' and pool like '%ec2%';
UPDATE 9
duffy=> update nodes set state='failed' where state='contextualizing' and pool like '%ec2%';
UPDATE 10

Should be ready to go Monday morning for you folks.

duffy=> update nodes set state='failed' where state='provisioning' and pool like '%ec2%'; UPDATE 9 duffy=> update nodes set state='failed' where state='contextualizing' and pool like '%ec2%'; UPDATE 10

Should be ready to go Monday morning for you folks.

Thank you :heart:

Wanted to chime in with an update from the dev side, too: I’m overhauling how sessions get created (plus side quest 😎) – the production code isn’t robust enough against concurrent attempts at allocating ready nodes into new sessions.

In theory it should be simple enough: Attempt to reserve nodes by marking them contextualizing and committing that to the database. If that fails (because a concurrent request tried to reserve some of the same nodes), lather, rinse, repeat a couple of times until this succeeds. At this point, the nodes, the session, etc. should be exclusive to the current request.

However, we’ve experienced database conflicts later in the workflow in the past and naïvely implemented “try a couple times” there (but wrongly!) to mitigate (without success, as you know). I’m working on getting this (more) watertight but the current logs don’t let me distinguish which request a specific log line relates to (the mentioned side quest – yeah, a rookie mistake). I’m almost done with the latter and have some ideas how to move forward with fixing the actual issue when this lets me actually understand what’s happening between the concurrent requests.

As already mentioned on IRC by @anoopcs, the issue is back again:

$ duffy client show-pool metal-ec2-c5n-centos-8s-x86_64
{
  "action": "get",
  "pool": {
    "name": "metal-ec2-c5n-centos-8s-x86_64",
    "fill_level": 3,
    "levels": {
      "provisioning": 3,
      "ready": 0,
      "contextualizing": 3,
      "deployed": 9,
      "deprovisioning": 0
    }
  }
}
$ duffy client show-pool virt-ec2-t2-centos-8s-x86_64
{
  "action": "get",
  "pool": {
    "name": "virt-ec2-t2-centos-8s-x86_64",
    "fill_level": 10,
    "levels": {
      "provisioning": 2,
      "ready": 8,
      "contextualizing": 3,
      "deployed": 15,
      "deprovisioning": 0
    }
  }
}
duffy=> update nodes set state='failed' where state='provisioning' and pool like '%ec2%';
UPDATE 6
duffy=> update nodes set state='failed' where state='contextualizing' and pool like '%ec2%';
UPDATE 10

Manually cleaned up instances in AWS. Should be good to go again shortly @mrc0mmand !

I think I've fixed the issue in this pull request: https://github.com/CentOS/duffy/pull/905

This is not yet deployed, needs to be rolled into a release first.

I’ve just released an RC on PyPI (v3.4.0rc1) which contains the fixes for this plus much more. I haven’t deployed this yet, because it will cause a short downtime – there’s a small DB schema change that needs to be applied.

@dkirwan and I updated Duffy to v3.4.0rc1, applied the schema change and restarted it at about 10:15~10:20 AM UTC today. We also cleaned up stuck/stale nodes in Duffy and EC2.

After the Duffy update I started seeing "ghost" sessions - i.e. sessions that have a node in "done" state, but are still active:

sh-4.4$ duffy client --format=flat list-sessions | sort
session_id=627900 active=TRUE created_at='2023-09-08 15:16:53.994044+00:00' retired_at= pool= hostname='n27-42-56.pool.ci.centos.org' ipaddr='172.27.42.56'
session_id=627942 active=TRUE created_at='2023-09-08 18:55:43.057920+00:00' retired_at= pool= hostname='n27-27-50.pool.ci.centos.org' ipaddr='172.27.27.50'
session_id=627944 active=TRUE created_at='2023-09-08 18:55:43.621269+00:00' retired_at= pool= hostname='n27-35-100.pool.ci.centos.org' ipaddr='172.27.35.100'
session_id=627946 active=TRUE created_at='2023-09-08 19:00:53.441673+00:00' retired_at= pool='metal-ec2-c5n-centos-8s-x86_64' hostname='n27-47-118.pool.ci.centos.org' ipaddr='172.27.47.118'
session_id=627949 active=TRUE created_at='2023-09-08 19:26:11.587519+00:00' retired_at= pool='virt-ec2-t2-centos-8s-x86_64' hostname='n27-13-36.pool.ci.centos.org' ipaddr='172.27.13.36'

Notice the empty pool= in the first three sessions.

$ duffy client show-session 627900
{
  "action": "get",
  "session": {
    "retired_at": null,
    "active": true,
    "created_at": "2023-09-08T15:16:53.994044+00:00",
    "id": 627900,
    "expires_at": "2023-09-08T21:16:53.993095+00:00",
    "tenant": {
...
    },
    "data": {
      "nodes_specs": [
        {
          "quantity": 1,
          "pool": "metal-ec2-c5n-centos-8s-x86_64"
        }
      ]
    },
    "nodes": [
      {
        "hostname": "n27-42-56.pool.ci.centos.org",
        "ipaddr": "172.27.42.56",
        "comment": null,
        "pool": null,
        "reusable": false,
        "data": {
          "provision": {
            "ec2_instance_id": "i-022f2d79e192b823e",
            "ec2_instance_type": "c5n.metal",
            "hostname": "n27-42-56.pool.ci.centos.org",
            "ipaddr": "172.27.42.56",
            "public_hostname": "ec2-54-83-190-51.compute-1.amazonaws.com",
            "public_ipaddress": "54.83.190.51"
          },
          "nodes_spec": {
            "quantity": 1,
            "pool": "metal-ec2-c5n-centos-8s-x86_64"
          }
        },
        "id": 158030,
        "state": "done"
      }
    ]
  }
}

@dkirwan @nphilipp Looks like the issue is back again :(

and again :disappointed:

{
  "action": "get",
  "pool": {
    "name": "metal-ec2-c5n-centos-8s-x86_64",
    "fill-level": 3,
    "levels": {
      "provisioning": 3,
      "ready": 0,
      "contextualizing": 0,
      "deployed": 9,
      "deprovisioning": 0
    }
  }
}

No ready nodes :expressionless:

had a quick look and yes, 3 were provisioned but then they were available (and even directly one was used :

{
  "action": "get",
  "pool": {
    "name": "metal-ec2-c5n-centos-8s-x86_64",
    "fill_level": 3,
    "levels": {
      "provisioning": 1,
      "ready": 2,
      "contextualizing": 0,
      "deployed": 10,
      "deprovisioning": 0
    }

1 week later I'm seeing much fewer issues like that described in this ticket originally?

duffy=> update nodes set state='failed' where state='provisioning' and pool like '%ec2%';
UPDATE 1
duffy=> update nodes set state='failed' where state='contextualizing' and pool like '%ec2%';
UPDATE 0

There seems to be a few new issues arising ? Shall we close this as fixed, and create a new ticket to track any new issues showing up in Duffy ?

There seems to be a few new issues arising ? Shall we close this as fixed, and create a new ticket to track any new issues showing up in Duffy ?

Yup, that definitely makes sense, as this one is already too big and the original issue seems to be gone (thanks to all involved!).

Let's close this one and open new issues if/when new issues arise (I haven't seen the "ghost" sessions I mentioned above for some time, so hopefully there's no new issue to report for now).

Metadata Update from @mrc0mmand:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

8 months ago

Login to comment on this ticket.

Metadata