#1132 Mysteriously disappearing Duffy sessions in CentOS CI
Closed: Fixed 10 months ago by zlopez. Opened a year ago by mrc0mmand.

Hey!

So, this is an issue I've been trying to hunt down for a while. In short, sometimes the machine we're running stuff on in CentOS CI suddenly disappears for no apparent reason. Yesterday I sprinkled our systemd "agent" script with some more debug info, and it looks like the session just disappears. Usually the session should be just retired and still exist, but in this case all traces to it are gone, and Duffy returns "not found":

2023-04-14 16:41:54,870 [agent-control/allocate_node] INFO: Attempting to allocate a node from pool virt-ec2-t2-centos-8s-x86_64
2023-04-14 16:41:56,873 [agent-control/allocate_node] INFO: Allocated node n27-41-233.pool.ci.centos.org with session id 177147
2023-04-14 16:41:56,874 [agent-control/allocate_node] INFO: Checking if node n27-41-233.pool.ci.centos.org is reachable over ssh
...
2023-04-14 16:41:57,129 [agent-control/execute_remote_command] INFO: Executing a REMOTE command on node 'n27-41-233.pool.ci.centos.org':
...
<the bootstrap/test script suddenly starts failing due to missing files/directories>
...
2023-04-14 16:54:05,433 [agent-control/show_session] ERROR: API returned an error: detail='Not Found'
2023-04-14 16:54:05,433 [agent-control/free_session] INFO: Freeing session 177147 (with node n27-41-233.pool.ci.centos.org) [try 0/10]
2023-04-14 16:54:05,752 [agent-control/free_session] INFO: Received an API error from the server: detail='Not Found'
2023-04-14 16:54:06,253 [agent-control/free_session] INFO: Freeing session 177147 (with node n27-41-233.pool.ci.centos.org) [try 1/10]
2023-04-14 16:54:06,590 [agent-control/free_session] INFO: Received an API error from the server: detail='Not Found'
2023-04-14 16:54:07,090 [agent-control/free_session] INFO: Freeing session 177147 (with node n27-41-233.pool.ci.centos.org) [try 2/10]
2023-04-14 16:54:07,423 [agent-control/free_session] INFO: Received an API error from the server: detail='Not Found'
2023-04-14 16:54:07,924 [agent-control/free_session] INFO: Freeing session 177147 (with node n27-41-233.pool.ci.centos.org) [try 3/10]
2023-04-14 16:54:08,251 [agent-control/free_session] INFO: Received an API error from the server: detail='Not Found'
2023-04-14 16:54:08,751 [agent-control/free_session] INFO: Freeing session 177147 (with node n27-41-233.pool.ci.centos.org) [try 4/10]
2023-04-14 16:54:09,080 [agent-control/free_session] INFO: Received an API error from the server: detail='Not Found'
2023-04-14 16:54:09,580 [agent-control/free_session] INFO: Freeing session 177147 (with node n27-41-233.pool.ci.centos.org) [try 5/10]
2023-04-14 16:54:09,904 [agent-control/free_session] INFO: Received an API error from the server: detail='Not Found'
2023-04-14 16:54:10,404 [agent-control/free_session] INFO: Freeing session 177147 (with node n27-41-233.pool.ci.centos.org) [try 6/10]
2023-04-14 16:54:10,722 [agent-control/free_session] INFO: Received an API error from the server: detail='Not Found'
2023-04-14 16:54:11,223 [agent-control/free_session] INFO: Freeing session 177147 (with node n27-41-233.pool.ci.centos.org) [try 7/10]
2023-04-14 16:54:11,540 [agent-control/free_session] INFO: Received an API error from the server: detail='Not Found'
2023-04-14 16:54:12,041 [agent-control/free_session] INFO: Freeing session 177147 (with node n27-41-233.pool.ci.centos.org) [try 8/10]
2023-04-14 16:54:12,363 [agent-control/free_session] INFO: Received an API error from the server: detail='Not Found'
2023-04-14 16:54:12,864 [agent-control/free_session] INFO: Freeing session 177147 (with node n27-41-233.pool.ci.centos.org) [try 9/10]
2023-04-14 16:54:13,185 [agent-control/free_session] INFO: Received an API error from the server: detail='Not Found'

As mentioned above - even if the session is released/retired, it should still exist. For example, this is from one of the older jobs:

2023-04-14 16:59:00,180 [agent-control/free_session] INFO: Freeing session 177135 (with node n27-37-99.pool.ci.centos.org) [try 0/10]
2023-04-14 16:59:01,005 [agent-control/free_session] INFO: Freeing session 177135 (with node n27-37-99.pool.ci.centos.org) [try 1/10]
2023-04-14 16:59:01,342 [agent-control/free_session] INFO: Session 177135 was successfully freed
$ duffy client show-session 177135
{
  "action": "get",
  "session": {
    "retired_at": "2023-04-14T16:59:00.493339+00:00",
    "active": false,
    "created_at": "2023-04-14T16:03:06.784239+00:00",
    "id": 177135,
    "expires_at": "2023-04-14T22:03:07.314981+00:00",
    "tenant": {
      "retired_at": null,
      "active": true,
      "created_at": "2022-06-23T13:25:27.880793+00:00",
      "name": "systemd",
...

But for the session above, I just get an error:

$  duffy client show-session 177147
{
  "error": {
    "detail": "Not Found"
  }
}

A couple of similarly "vaporized" sessions from our CI logs:

  • 172808
  • 176853
  • 176339
  • 175409
  • 174837
  • 174065
  • 174056
  • 172457
  • 171646
  • 171253
  • 170847
  • 168266
  • 164508
  • 164050

Metadata Update from @arrfab:
- Issue tagged with: centos-ci-infra, investigation, medium-gain, medium-trouble

a year ago

@nphilipp : any idea ? had no time to investigate in duffy db/log itself so it seems you're the best placed to have a look

Hmm, I see a couple of DB serialization errors related to creating sessions. I started retrying more of the involved DB activity on serialization failure, but I’m at a loss why users would even get an id in the first place then – if a transaction fails, the API shouldn’t yield anything except the HTTP error.

Metadata Update from @arrfab:
- Issue assigned to nphilipp

a year ago

Just a couple of minutes ago I noticed two jobs failed with the same symptoms as above:

compiling _libvirt.c
compiling common.c
compiling connect.c
compiling domain.c
Assembler messages:
Fatal error: can't create domain.o: No such file or directory
make: *** [Makefile:245: domain.o] Error 1

make failed, exit code 2

<...snip...>

2023-05-24 05:08:47,988 [agent-control/fetch_artifacts] INFO: Fetching artifacts from node n27-4-233.pool.ci.centos.org: (remote: ~/vagrant-logs*, local: ./artifacts_4x43mp0n)
2023-05-24 05:08:47,988 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/scp -r -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no root@n27-4-233.pool.ci.centos.org:~/vagrant-logs* ./artifacts_4x43mp0n
ssh: Could not resolve hostname n27-4-233.pool.ci.centos.org: Name or service not known
2023-05-24 05:08:49,131 [agent-control/execute_remote_command] WARNING: Fetching artifacts failed
2023-05-24 05:08:49,131 [agent-control/main] ERROR: Execution failed
Traceback (most recent call last):
  File "<https://jenkins-systemd.apps.ocp.cloud.ci.centos.org/job/upstream-vagrant-archlinux-sanitizers/ws/systemd-centos-ci/./agent-control.py",> line 494, in main
    ac.execute_remote_command(command, artifacts_dir="~/vagrant-logs*")
  File "<https://jenkins-systemd.apps.ocp.cloud.ci.centos.org/job/upstream-vagrant-archlinux-sanitizers/ws/systemd-centos-ci/./agent-control.py",> line 178, in execute_remote_command
    raise RuntimeError("Remote command exited with an unexpected return code "
RuntimeError: Remote command exited with an unexpected return code (got: 255, expected: 0), bailing out
2023-05-24 05:08:49,474 [_client/_send_single_request] INFO: HTTP Request: GET https://duffy.ci.centos.org/api/v1/sessions/186340 "HTTP/1.1 404 Not Found"
2023-05-24 05:08:49,474 [agent-control/show_session] ERROR: API returned an error: detail='Not Found'
2023-05-24 05:08:49,474 [agent-control/free_session] INFO: [Try 1/10] Freeing session 186340 (with node n27-4-233.pool.ci.centos.org)
2023-05-24 05:08:49,811 [_client/_send_single_request] INFO: HTTP Request: PUT https://duffy.ci.centos.org/api/v1/sessions/186340 "HTTP/1.1 404 Not Found"
2023-05-24 05:08:49,812 [agent-control/free_session] INFO: Received an API error from the server: detail='Not Found'
2023-05-24 05:08:50,812 [agent-control/free_session] INFO: [Try 2/10] Freeing session 186340 (with node n27-4-233.pool.ci.centos.org)
2023-05-24 05:08:51,137 [_client/_send_single_request] INFO: HTTP Request: PUT https://duffy.ci.centos.org/api/v1/sessions/186340 "HTTP/1.1 404 Not Found"
2023-05-24 05:08:51,137 [agent-control/free_session] INFO: Received an API error from the server: detail='Not Found'
2023-05-24 05:08:52,137 [agent-control/free_session] INFO: [Try 3/10] Freeing session 186340 (with node n27-4-233.pool.ci.centos.org)
2023-05-24 05:08:52,461 [_client/_send_single_request] INFO: HTTP Request: PUT https://duffy.ci.centos.org/api/v1/sessions/186340 "HTTP/1.1 404 Not Found"
2023-05-24 05:08:52,462 [agent-control/free_session] INFO: Received an API error from the server: detail='Not Found'
2023-05-24 05:08:53,462 [agent-control/free_session] INFO: [Try 4/10] Freeing session 186340 (with node n27-4-233.pool.ci.centos.org)
2023-05-24 05:08:53,791 [_client/_send_single_request] INFO: HTTP Request: PUT https://duffy.ci.centos.org/api/v1/sessions/186340 "HTTP/1.1 404 Not Found"
2023-05-24 05:08:53,791 [agent-control/free_session] INFO: Received an API error from the server: detail='Not Found'
2023-05-24 05:08:54,791 [agent-control/free_session] INFO: [Try 5/10] Freeing session 186340 (with node n27-4-233.pool.ci.centos.org)
2023-05-24 05:08:55,187 [_client/_send_single_request] INFO: HTTP Request: PUT https://duffy.ci.centos.org/api/v1/sessions/186340 "HTTP/1.1 404 Not Found"
2023-05-24 05:08:55,187 [agent-control/free_session] INFO: Received an API error from the server: detail='Not Found'
2023-05-24 05:08:56,187 [agent-control/free_session] INFO: [Try 6/10] Freeing session 186340 (with node n27-4-233.pool.ci.centos.org)
2023-05-24 05:08:56,601 [_client/_send_single_request] INFO: HTTP Request: PUT https://duffy.ci.centos.org/api/v1/sessions/186340 "HTTP/1.1 404 Not Found"
2023-05-24 05:08:56,601 [agent-control/free_session] INFO: Received an API error from the server: detail='Not Found'
2023-05-24 05:08:57,601 [agent-control/free_session] INFO: [Try 7/10] Freeing session 186340 (with node n27-4-233.pool.ci.centos.org)
2023-05-24 05:08:57,925 [_client/_send_single_request] INFO: HTTP Request: PUT https://duffy.ci.centos.org/api/v1/sessions/186340 "HTTP/1.1 404 Not Found"
2023-05-24 05:08:57,925 [agent-control/free_session] INFO: Received an API error from the server: detail='Not Found'
2023-05-24 05:08:58,925 [agent-control/free_session] INFO: [Try 8/10] Freeing session 186340 (with node n27-4-233.pool.ci.centos.org)
2023-05-24 05:08:59,342 [_client/_send_single_request] INFO: HTTP Request: PUT https://duffy.ci.centos.org/api/v1/sessions/186340 "HTTP/1.1 404 Not Found"
2023-05-24 05:08:59,342 [agent-control/free_session] INFO: Received an API error from the server: detail='Not Found'
2023-05-24 05:09:00,342 [agent-control/free_session] INFO: [Try 9/10] Freeing session 186340 (with node n27-4-233.pool.ci.centos.org)
2023-05-24 05:09:00,765 [_client/_send_single_request] INFO: HTTP Request: PUT https://duffy.ci.centos.org/api/v1/sessions/186340 "HTTP/1.1 404 Not Found"
2023-05-24 05:09:00,766 [agent-control/free_session] INFO: Received an API error from the server: detail='Not Found'
2023-05-24 05:09:01,766 [agent-control/free_session] INFO: [Try 10/10] Freeing session 186340 (with node n27-4-233.pool.ci.centos.org)
2023-05-24 05:09:02,130 [_client/_send_single_request] INFO: HTTP Request: PUT https://duffy.ci.centos.org/api/v1/sessions/186340 "HTTP/1.1 404 Not Found"
2023-05-24 05:09:02,131 [agent-control/free_session] INFO: Received an API error from the server: detail='Not Found'

However, after a closer look, I noticed that both jobs use a machine with the same FQDN, but both of them have a different session ID. The second job:

2023-05-24 05:08:15,356 [_client/_send_single_request] INFO: HTTP Request: GET https://duffy.ci.centos.org/api/v1/sessions/186342 "HTTP/1.1 200 OK"
2023-05-24 05:08:15,356 [agent-control/show_session] INFO: Session: 186342
2023-05-24 05:08:15,356 [agent-control/show_session] INFO: Session owner: systemd
2023-05-24 05:08:15,356 [agent-control/show_session] INFO: Session active state: True
2023-05-24 05:08:15,356 [agent-control/show_session] INFO: Session retired at: None
2023-05-24 05:08:15,357 [agent-control/show_session] INFO: First session node: n27-4-233.pool.ci.centos.org
2023-05-24 05:08:15,357 [agent-control/free_session] INFO: [Try 1/10] Freeing session 186342 (with node n27-4-233.pool.ci.centos.org)
2023-05-24 05:08:15,705 [_client/_send_single_request] INFO: HTTP Request: PUT https://duffy.ci.centos.org/api/v1/sessions/186342 "HTTP/1.1 200 OK"
2023-05-24 05:08:16,706 [agent-control/free_session] INFO: [Try 2/10] Freeing session 186342 (with node n27-4-233.pool.ci.centos.org)
2023-05-24 05:08:17,044 [_client/_send_single_request] INFO: HTTP Request: PUT https://duffy.ci.centos.org/api/v1/sessions/186342 "HTTP/1.1 422 Unprocessable Entity"
2023-05-24 05:08:17,044 [agent-control/free_session] INFO: Session 186342 was successfully freed

Notice that both of the jobs use a machine n27-4-233.pool.ci.centos.org, but have different session IDs (186342 vs 186340).

@nphilipp : had you a chance to look at the issue ?

Sorry for dropping the ball 🙈.

Here’s my working hypothesis: Concurrent requests attempt to allocate the same nodes (modify the respective objects in the DB) – only one of those requests can succeed, the other(s) have to fail. Now, the transaction is committed automatically after the request handler has succeeded, but by this time the response has been sent already.

I plan to let it explicitly commit the transaction in the request handler, if this doesn’t fix things completely, we should at least get better information about why this happens in the first place.

@arrfab I’ve submitted a PR over on GitHub: https://github.com/CentOS/duffy/pull/852

I’m hesitant to just roll a release with it, given that we don’t exactly know what causes this and therefore, if this is the right approach. Do you think we can test this as a hotfix?

@nphilipp as we don't have .stg. env anymore for Duffy, yes, probably the only way to test is (unfortunately) to do it "live"... we can "hotfix" it (so just replacing .py file or patch it) but ideally we'd sync to just to that together and keep an eye after it's done (and so test duffy at functional level). Does that work for you ?

I could roll a pre-release, this way it probably won’t be picked up by the playbook but you can use something like pip install -U --pre … in the virtualenv rather than copy files in places. If that proves to be at least not worse, then I can roll the next stable version. @arrfab What do you think?

@mrc0mmand, we deployed a version with the changes I talked about above. Would you please check if these changes improved things for you and report back? As I understand it, this only happened occasionally, so take as much time for it as you need. Thanks!

@mrc0mmand, we deployed a version with the changes I talked about above. Would you please check if these changes improved things for you and report back? As I understand it, this only happened occasionally, so take as much time for it as you need. Thanks!

Thank you! I'll keep monitoring the jobs and will report back if I see anything suspicious.

Metadata Update from @zlopez:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

10 months ago

Login to comment on this ticket.

Metadata