#977 "Disappearing" CentOS CI machines
Closed: Fixed with Explanation 2 years ago by arrfab. Opened 2 years ago by mrc0mmand.

Since tomorrow I've noticed that sometimes the machines provided by CentOS CI "disappear" - i.e. the machine is no longer reachable and ssh-ing to it ends up with "Name or service not known".

The first occurrence (given the logs I have at hand) was on 2022-11-22 ~23:35 CET:

ssh: Could not resolve hostname n27-18-187.pool.ci.centos.org: Name or service not known
ssh: Could not resolve hostname n27-24-219.pool.ci.centos.org: Name or service not known
ssh: Could not resolve hostname n27-28-155.pool.ci.centos.org: Name or service not known

Another one happened today (2022-11-23) around 15:50 CET:

ssh: Could not resolve hostname n27-15-186.pool.ci.centos.org: Name or service not known
ssh: Could not resolve hostname n27-33-104.pool.ci.centos.org: Name or service not known
ssh: Could not resolve hostname n27-32-227.pool.ci.centos.org: Name or service not known
ssh: Could not resolve hostname n27-12-43.pool.ci.centos.org: Name or service not known
ssh: Could not resolve hostname n27-12-99.pool.ci.centos.org: Name or service not known
ssh: Could not resolve hostname n27-34-239.pool.ci.centos.org: Name or service not known
ssh: Could not resolve hostname n27-44-94.pool.ci.centos.org: Name or service not known
ssh: Could not resolve hostname n27-29-55.pool.ci.centos.org: Name or service not known
ssh: Could not resolve hostname n27-24-180.pool.ci.centos.org: Name or service not known
ssh: Could not resolve hostname n27-40-14.pool.ci.centos.org: Name or service not known

It seems to happen in all pools we use (i.e. both metal and T2 ones, C8S and C9S).


Another interesting point - after the last "event" the respective Duffy sessions were left behind, which means that the pods couldn't reach the API endpoints. Also, there were 11 sessions attached to our tenant instead of 10 (which is the current quota):

$ duffy client --format=flat list-sessions
session_id=149606 active=TRUE created_at='2022-11-23 14:54:55.576215+00:00' retired_at= pool='virt-ec2-t2-centos-9s-x86_64' hostname='n27-22-200.pool.ci.centos.org' ipaddr='172.27.22.200'
session_id=149561 active=TRUE created_at='2022-11-23 14:52:20.960602+00:00' retired_at= pool='virt-ec2-t2-centos-8s-x86_64' hostname='n27-9-193.pool.ci.centos.org' ipaddr='172.27.9.193'
session_id=149576 active=TRUE created_at='2022-11-23 14:54:18.606937+00:00' retired_at= pool='virt-ec2-t2-centos-8s-x86_64' hostname='n27-3-7.pool.ci.centos.org' ipaddr='172.27.3.7'
session_id=149580 active=TRUE created_at='2022-11-23 14:54:21.863244+00:00' retired_at= pool='virt-ec2-t2-centos-8s-x86_64' hostname='n27-31-219.pool.ci.centos.org' ipaddr='172.27.31.219'
session_id=149582 active=TRUE created_at='2022-11-23 14:54:23.765660+00:00' retired_at= pool='virt-ec2-t2-centos-8s-x86_64' hostname='n27-3-0.pool.ci.centos.org' ipaddr='172.27.3.0'
session_id=149588 active=TRUE created_at='2022-11-23 14:54:26.006745+00:00' retired_at= pool='virt-ec2-t2-centos-9s-x86_64' hostname='n27-14-180.pool.ci.centos.org' ipaddr='172.27.14.180'
session_id=149590 active=TRUE created_at='2022-11-23 14:54:26.726718+00:00' retired_at= pool='virt-ec2-t2-centos-8s-x86_64' hostname='n27-40-128.pool.ci.centos.org' ipaddr='172.27.40.128'
session_id=149594 active=TRUE created_at='2022-11-23 14:54:30.849463+00:00' retired_at= pool='virt-ec2-t2-centos-8s-x86_64' hostname='n27-46-222.pool.ci.centos.org' ipaddr='172.27.46.222'
session_id=149596 active=TRUE created_at='2022-11-23 14:54:31.943544+00:00' retired_at= pool='virt-ec2-t2-centos-8s-x86_64' hostname='n27-11-240.pool.ci.centos.org' ipaddr='172.27.11.240'
session_id=149603 active=TRUE created_at='2022-11-23 14:54:53.117909+00:00' retired_at= pool='virt-ec2-t2-centos-9s-x86_64' hostname='n27-47-210.pool.ci.centos.org' ipaddr='172.27.47.210'
session_id=149608 active=TRUE created_at='2022-11-23 14:54:59.543202+00:00' retired_at= pool='virt-ec2-t2-centos-9s-x86_64' hostname='n27-11-8.pool.ci.centos.org' ipaddr='172.27.11.8'

More occurrences from today:

05:52 CET:

[TASK] TEST-38-FREEZER (/bin/time -v -- make -C test/TEST-38-FREEZER setup run && touch $TESTDIR/pass; rm -fv $TESTDIR/*.img; test -e $TESTDIR/pass) [try 1/2]
[RESULT] TEST-38-FREEZER - PASS (log file: /root/testsuite-logs-upstream.Hvf/TEST-38-FREEZER_1_PASS.log)

[TASK] TEST-39-EXECRELOAD (/bin/time -v -- make -C test/TEST-39-EXECRELOAD setup run && touch $TESTDIR/pass; rm -fv $TESTDIR/*.img; test -e $TESTDIR/pass) [try 1/2]
2022-11-24 04:47:47,650 [agent-control/fetch_artifacts] INFO: Fetching artifacts from node n27-9-144.pool.ci.centos.org: (remote: ~/testsuite-logs*, local: ./artifacts_3yrmtuu_)
2022-11-24 04:47:47,650 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/scp -r -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no root@n27-9-144.pool.ci.centos.org:~/testsuite-logs* ./artifacts_3yrmtuu_
ssh: Could not resolve hostname n27-9-144.pool.ci.centos.org: Name or service not known
2022-11-24 04:47:53,673 [agent-control/execute_remote_command] WARNING: Fetching artifacts failed
2022-11-24 04:47:53,673 [agent-control/main] INFO: Trying to collect kernel dumps from n27-9-144.pool.ci.centos.org:/var/crash
2022-11-24 04:48:03,676 [agent-control/wait_for_node] INFO: Checking if the node n27-9-144.pool.ci.centos.org is alive (try #0)
2022-11-24 04:48:03,677 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ping -q -c 1 -W 10 n27-9-144.pool.ci.centos.org
ping: n27-9-144.pool.ci.centos.org: Name or service not known
2022-11-24 04:48:24,707 [agent-control/wait_for_node] INFO: Checking if the node n27-9-144.pool.ci.centos.org is alive (try #1)
2022-11-24 04:48:24,707 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ping -q -c 1 -W 10 n27-9-144.pool.ci.centos.org
ping: n27-9-144.pool.ci.centos.org: Name or service not known
2022-11-24 04:48:45,737 [agent-control/wait_for_node] INFO: Checking if the node n27-9-144.pool.ci.centos.org is alive (try #2)
2022-11-24 04:48:45,738 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ping -q -c 1 -W 10 n27-9-144.pool.ci.centos.org
ping: n27-9-144.pool.ci.centos.org: Name or service not known
2022-11-24 04:49:06,771 [agent-control/wait_for_node] INFO: Checking if the node n27-9-144.pool.ci.centos.org is alive (try #3)
2022-11-24 04:49:06,771 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ping -q -c 1 -W 10 n27-9-144.pool.ci.centos.org
ping: n27-9-144.pool.ci.centos.org: Name or service not known
2022-11-24 04:49:27,788 [agent-control/wait_for_node] INFO: Checking if the node n27-9-144.pool.ci.centos.org is alive (try #4)
2022-11-24 04:49:27,789 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ping -q -c 1 -W 10 n27-9-144.pool.ci.centos.org
ping: n27-9-144.pool.ci.centos.org: Name or service not known
2022-11-24 04:49:48,820 [agent-control/wait_for_node] INFO: Checking if the node n27-9-144.pool.ci.centos.org is alive (try #5)
2022-11-24 04:49:48,820 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ping -q -c 1 -W 10 n27-9-144.pool.ci.centos.org
ping: n27-9-144.pool.ci.centos.org: Name or service not known
2022-11-24 04:50:09,851 [agent-control/wait_for_node] INFO: Checking if the node n27-9-144.pool.ci.centos.org is alive (try #6)
2022-11-24 04:50:09,851 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ping -q -c 1 -W 10 n27-9-144.pool.ci.centos.org
ping: n27-9-144.pool.ci.centos.org: Name or service not known
2022-11-24 04:50:30,872 [agent-control/wait_for_node] INFO: Checking if the node n27-9-144.pool.ci.centos.org is alive (try #7)
2022-11-24 04:50:30,872 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ping -q -c 1 -W 10 n27-9-144.pool.ci.centos.org
ping: n27-9-144.pool.ci.centos.org: Name or service not known
2022-11-24 04:50:51,903 [agent-control/wait_for_node] INFO: Checking if the node n27-9-144.pool.ci.centos.org is alive (try #8)
2022-11-24 04:50:51,903 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ping -q -c 1 -W 10 n27-9-144.pool.ci.centos.org
ping: n27-9-144.pool.ci.centos.org: Name or service not known
2022-11-24 04:51:12,934 [agent-control/wait_for_node] INFO: Checking if the node n27-9-144.pool.ci.centos.org is alive (try #9)
2022-11-24 04:51:12,934 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ping -q -c 1 -W 10 n27-9-144.pool.ci.centos.org
ping: n27-9-144.pool.ci.centos.org: Name or service not known
2022-11-24 04:51:33,964 [agent-control/main] ERROR: Execution failed

09:39 - 09:43:

ssh: Could not resolve hostname n27-0-85.pool.ci.centos.org: Name or service not known
ssh: Could not resolve hostname n27-45-53.pool.ci.centos.org: Name or service not known

I can't even reach Duffy from the slave01 node:

[Prod - slave01.ci.centos.org]
(duffy) [systemd@slave01 ~]$ duffy client --format=flat list-sessions
Traceback (most recent call last):
  File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/httpcore/_exceptions.py", line 8, in map_exceptions
    yield
  File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/httpcore/backends/sync.py", line 26, in read
    return self._sock.recv(max_bytes)
  File "/opt/rh/rh-python38/root/usr/lib64/python3.8/ssl.py", line 1226, in recv
    return self.read(buflen)
  File "/opt/rh/rh-python38/root/usr/lib64/python3.8/ssl.py", line 1101, in read
    return self._sslobj.read(len)
socket.timeout: The read operation timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/httpx/_transports/default.py", line 60, in map_httpcore_exceptions
    yield
  File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/httpx/_transports/default.py", line 218, in handle_request
    resp = self._pool.handle_request(req)
  File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/httpcore/_sync/connection_pool.py", line 253, in handle_request
    raise exc
  File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/httpcore/_sync/connection_pool.py", line 237, in handle_request
    response = connection.handle_request(request)
  File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/httpcore/_sync/connection.py", line 90, in handle_request
    return self._connection.handle_request(request)
  File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/httpcore/_sync/http11.py", line 105, in handle_request
    raise exc
  File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/httpcore/_sync/http11.py", line 84, in handle_request
    ) = self._receive_response_headers(**kwargs)
  File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/httpcore/_sync/http11.py", line 148, in _receive_response_headers
    event = self._receive_event(timeout=timeout)
  File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/httpcore/_sync/http11.py", line 177, in _receive_event
    data = self._network_stream.read(
  File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/httpcore/backends/sync.py", line 26, in read
    return self._sock.recv(max_bytes)
  File "/opt/rh/rh-python38/root/usr/lib64/python3.8/contextlib.py", line 131, in __exit__
    self.gen.throw(type, value, traceback)
  File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/httpcore/_exceptions.py", line 12, in map_exceptions
    raise to_exc(exc)
httpcore.ReadTimeout: The read operation timed out

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/systemd/venv/duffy/bin/duffy", line 8, in <module>
    sys.exit(cli())
  File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/click/decorators.py", line 38, in new_func
    return f(get_current_context().obj, *args, **kwargs)
  File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/duffy/cli.py", line 670, in client_list_sessions
    result = obj["client"].list_sessions()
  File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/duffy/client/main.py", line 109, in list_sessions
    return self._query_method(
  File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/duffy/client/main.py", line 94, in _query_method
    response = client_method(url=url, **add_kwargs)
  File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/httpx/_client.py", line 1039, in get
    return self.request(
  File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/httpx/_client.py", line 815, in request
    return self.send(request, auth=auth, follow_redirects=follow_redirects)
  File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/httpx/_client.py", line 902, in send
    response = self._send_handling_auth(
  File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/httpx/_client.py", line 930, in _send_handling_auth
    response = self._send_handling_redirects(
  File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/httpx/_client.py", line 967, in _send_handling_redirects
    response = self._send_single_request(request)
  File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/httpx/_client.py", line 1003, in _send_single_request
    response = transport.handle_request(request)
  File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/httpx/_transports/default.py", line 218, in handle_request
    resp = self._pool.handle_request(req)
  File "/opt/rh/rh-python38/root/usr/lib64/python3.8/contextlib.py", line 131, in __exit__
    self.gen.throw(type, value, traceback)
  File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/httpx/_transports/default.py", line 77, in map_httpcore_exceptions
    raise mapped_exc(message) from exc
httpx.ReadTimeout: The read operation timed out

nor from the pods:

00:00:23.659 2022-11-24 08:49:44,791 [agent-control/allocate_node] INFO: Attempting to allocate a node from pool metal-ec2-c5n-centos-8s-x86_64
00:02:19.014 2022-11-24 08:51:40,145 [agent-control/allocate_node] ERROR: [Try 12/1440] Received an API error from the server: Timeout while requesting URL('https://duffy.ci.centos.org/api/v1/sessions')

And trying to ping an AWS node from slave01 yields the known error:

[Prod - slave01.ci.centos.org]
(duffy) [systemd@slave01 ~]$ ping n27-45-53.pool.ci.centos.org
ping: n27-45-53.pool.ci.centos.org: Name or service not known

Metadata Update from @arrfab:
- Issue assigned to arrfab
- Issue tagged with: centos-ci-infra, high-gain, high-trouble

2 years ago

(discussed also on irc in #centos-ci on irc.libera.chat but giving summary here) :

From zabbix/monitoring logs I confirm that we had issues and related to the vpn itself going down and then up (between rdu2c DC and remote AWS VPCs).

the Name or service not known issue (when having to deal with node in the pool.ci.centos.org sub-domain is due to that reason : it's a remote (so through vpn) private zone hosted in aws/route53. So when tunnel goes down, local resolvers, forwarding to route53, aren't able to resolve it.

The other issue I see is through interesting and should not be linked to external network :
duffy client --format=flat list-sessions is internal (and duffy.ci.centos.org resolves internally too) so duffy api should be able to answer internal queries without having timeouts (but something that @nphilipp can confirm). I would understand if it would happen during a duffy client request-session as it would have to reach remote ec2 node for contextualization and would fail

or (and to be investigated) something related to internal network issues (which would then explain both the vpn and internal endpoint issue)

Just an update : after some verifications, we decided ~24h ago to disable a second vpn ipsec tunnel that will be needed for ocp prod in aws and it seems now "stable".

We still don't know (we had other fires to fight this friday) if the underlying issue is network related (temporary) or due to the second ipsec tunnel (which was working fine since monday so some time before first errors appeared).

We have so now two options :

  • enable back second tunnel on same ipsec host and see how that goes
  • have the second tunnel ipsec endpoint being a different nodes and just adapt routing tables distributed in rdu2c ci to use the other node (preferred solution even if that just means deploying quickly a different VM for that)

Oh well, I don't want to ruin the Friday evening, but it just happened again:

ssh: Could not resolve hostname n27-18-81.pool.ci.centos.org: Name or service not known
2022-11-25 17:46:54,749 [agent-control/execute_remote_command] WARNING: Fetching artifacts failed
ssh: Could not resolve hostname n27-26-130.pool.ci.centos.org: Name or service not known
2022-11-25 17:47:26,839 [agent-control/execute_remote_command] WARNING: Fetching artifacts failed
ssh: Could not resolve hostname n27-25-177.pool.ci.centos.org: Name or service not known
2022-11-25 17:47:06,493 [agent-control/execute_remote_command] WARNING: Fetching artifacts failed

etc.

And as previously, I can't access Duffy from the slave01 node as well:;

[Prod - slave01.ci.centos.org]
(duffy) [systemd@slave01 ~]$ duffy client --format=flat list-sessions    
Traceback (most recent call last):
  File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/httpcore/_exceptions.py", line 8, in map_exceptions
    yield
  File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/httpcore/backends/sync.py", line 26, in read
    return self._sock.recv(max_bytes)
  File "/opt/rh/rh-python38/root/usr/lib64/python3.8/ssl.py", line 1226, in recv
    return self.read(buflen)
  File "/opt/rh/rh-python38/root/usr/lib64/python3.8/ssl.py", line 1101, in read
    return self._sslobj.read(len)
socket.timeout: The read operation timed out
...

  File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/httpx/_transports/default.py", line 218, in handle_request
    resp = self._pool.handle_request(req)
  File "/opt/rh/rh-python38/root/usr/lib64/python3.8/contextlib.py", line 131, in __exit__
    self.gen.throw(type, value, traceback)
  File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/httpx/_transports/default.py", line 77, in map_httpcore_exceptions
    raise mapped_exc(message) from exc
httpx.ReadTimeout: The read operation timed out

yes, I confirm that through some running network tests through the vpn tunnel .. :/
What I did this saturday morning was to just revert to previous kernel (that was running fine before the 2nd tunnel was added and vpn node restarted) and let's see how that goes.
I also added back 2nd tunnel as clearly it wasn't the issue.

Let me know in the next days how that looks like for you

A couple more occurrences from today (from cico-workspace OCP pods):

# 13:06 UTC
ssh: Could not resolve hostname n27-37-102.pool.ci.centos.org: Name or service not known
ssh: Could not resolve hostname n27-30-164.pool.ci.centos.org: Name or service not known
ssh: Could not resolve hostname n27-15-59.pool.ci.centos.org: Name or service not known
# 14:17 UTC
ssh: Could not resolve hostname n27-5-154.pool.ci.centos.org: Name or service not known
ssh: Could not resolve hostname n27-10-82.pool.ci.centos.org: Name or service not known
ssh: Could not resolve hostname n27-20-125.pool.ci.centos.org: Name or service not known

Also (from a Duffy metal node):

# 15:13 UTC
Could not resolve host: artifacts.ci.centos.org

Was testing again and with some loops from both sides of the only running tunnel :

40013 packets transmitted, 40002 received, 0.0274911% packet loss, time 40067681ms

and other side for a different target :

40014 packets transmitted, 40006 received, +7 errors, 0% packet loss, time 40066768ms

That seems not "pretty" stable so let's continue to investigate

Some other stats and for a vpn tunnel, that seems correct :

80156 packets transmitted, 80147 received, 0.0112281% packet loss, time 80266810ms

I'd consider this one now working but while troubleshooting that particular vpn I had to establish the second AWS tunnel and also working fine but with some enhancements, so I'll probably also apply (should be really fast) these to the first tunnel . Let me close this one for now

Metadata Update from @arrfab:
- Issue close_status updated to: Fixed with Explanation
- Issue status updated to: Closed (was: Open)

2 years ago

Log in to comment on this ticket.

Metadata
Boards 1
CentOS CI Infra Status: Backlog