Hey,
Recently I noticed that we hit the metal node limit relatively often, and after watching the output of duffy client show-pool ... for some time, I wonder if there's another issue related to session retirement. It might be related to my comment from the previous ticket - https://pagure.io/centos-infra/issue/1211#comment-873349.
duffy client show-pool ...
Long story short, I noticed that many nodes remain "deployed" for quite a long time (several hours), which seems suspicious when compared to my previous observations. Would it be possible to query the DB/Duffy logs for machines that were garbage collected (i.e. allocated for >= 6 hours), just to see if the sessions/nodes are properly cleaned up? Unfortunately, all I can do from outside is to patiently watch the output of duffy client show-pool.
duffy client show-pool
What I think is happening is that occasionally the node retirement fails (for whatever reason) and we slowly accumulate deployed nodes over time. And sometimes this is frequent enough that the 6-hour garbage-collect window is not enough to clean up the mess and we trip over the pool limits (this mostly applies to the metal pools). I noticed the issue in both the metal and the ec2 pools, though.
So, after watching the show-pool output for a couple of hours (on a second monitor) there has been no change in the number of deployed nodes in the metal pool, and, similarly, the number of deployed nodes in the C8S pool remains suspiciously high, so I guess even the garbage-collector doesn't work as expected (at least for these nodes).
show-pool
For record keeping, the state of the "monitored" pools ATTOW (~00:45 AM CEST):
{ "action": "get", "pool": { "name": "virt-ec2-t2-centos-8s-x86_64", "fill_level": 10, "levels": { "provisioning": 1, "ready": 9, "contextualizing": 0, "deployed": 17, "deprovisioning": 0 } } } { "action": "get", "pool": { "name": "metal-ec2-c5n-centos-8s-x86_64", "fill_level": 3, "levels": { "provisioning": 3, "ready": 0, "contextualizing": 0, "deployed": 9, "deprovisioning": 0 } } }
Looks like the situation hasn't changed overnight, so there's definitely something going south:
/cc @nphilipp @dkirwan
Metadata Update from @arrfab: - Issue assigned to nphilipp - Issue tagged with: centos-ci-infra, high-gain, high-trouble, investigation
I've unstuck the stuck nodes (set the nodes to "failed" in the database after pinging some of them to verify they’re actually gone) but because of the current log configuration, I can’t debug what happened back then when they should have been deprovisioned.
I’ve set up persistent logging on the Duffy host, so next time this happens, we should hopefully have some information to debug the issue. There’s little else I can do at the moment, though.
just reviewing open tickets and wondering what's the status on this one ?
I haven't seen anything suspicious on my side for the past month+.
@mrc0mmand thanks for the feedback ! @nphilipp : what about you cut a final release (iirc it was a rc) so that we can just also trigger the cico-workspace container rebuild to also have duffy client updated to same level ?
cico-workspace
Today I noticed two stray sessions without any "pool" left over even after retiring them:
sh-4.4$ duffy client --format=flat list-sessions | sort session_id=699316 active=TRUE created_at='2023-11-02 12:10:34.383716+00:00' retired_at= pool= hostname='n27-22-134.pool.ci.centos.org' ipaddr='172.27.22.134' session_id=699363 active=TRUE created_at='2023-11-02 12:18:44.522731+00:00' retired_at= pool= hostname='n27-24-186.pool.ci.centos.org' ipaddr='172.27.24.186'
I cleaned out things lately, and only see active, unexpired sessions. Do you think we can close this, or shall we keep it open for monitoring?
@mrc0mmand , @nphilipp : just having a look at tickets open for a long time and was wondering if we can say that problem disappeared, or something else had to be investigated at this point ? Feel free to close ticket if that's not an issue anymore :)
Yeah, I guess we can close this one, haven't seen any such issues in a long time.
Metadata Update from @mrc0mmand: - Issue close_status updated to: Fixed - Issue status updated to: Closed (was: Open)
Log in to comment on this ticket.