Since tomorrow I've noticed that sometimes the machines provided by CentOS CI "disappear" - i.e. the machine is no longer reachable and ssh-ing to it ends up with "Name or service not known".
The first occurrence (given the logs I have at hand) was on 2022-11-22 ~23:35 CET:
ssh: Could not resolve hostname n27-18-187.pool.ci.centos.org: Name or service not known ssh: Could not resolve hostname n27-24-219.pool.ci.centos.org: Name or service not known ssh: Could not resolve hostname n27-28-155.pool.ci.centos.org: Name or service not known
Another one happened today (2022-11-23) around 15:50 CET:
ssh: Could not resolve hostname n27-15-186.pool.ci.centos.org: Name or service not known ssh: Could not resolve hostname n27-33-104.pool.ci.centos.org: Name or service not known ssh: Could not resolve hostname n27-32-227.pool.ci.centos.org: Name or service not known ssh: Could not resolve hostname n27-12-43.pool.ci.centos.org: Name or service not known ssh: Could not resolve hostname n27-12-99.pool.ci.centos.org: Name or service not known ssh: Could not resolve hostname n27-34-239.pool.ci.centos.org: Name or service not known ssh: Could not resolve hostname n27-44-94.pool.ci.centos.org: Name or service not known ssh: Could not resolve hostname n27-29-55.pool.ci.centos.org: Name or service not known ssh: Could not resolve hostname n27-24-180.pool.ci.centos.org: Name or service not known ssh: Could not resolve hostname n27-40-14.pool.ci.centos.org: Name or service not known
It seems to happen in all pools we use (i.e. both metal and T2 ones, C8S and C9S).
Another interesting point - after the last "event" the respective Duffy sessions were left behind, which means that the pods couldn't reach the API endpoints. Also, there were 11 sessions attached to our tenant instead of 10 (which is the current quota):
$ duffy client --format=flat list-sessions session_id=149606 active=TRUE created_at='2022-11-23 14:54:55.576215+00:00' retired_at= pool='virt-ec2-t2-centos-9s-x86_64' hostname='n27-22-200.pool.ci.centos.org' ipaddr='172.27.22.200' session_id=149561 active=TRUE created_at='2022-11-23 14:52:20.960602+00:00' retired_at= pool='virt-ec2-t2-centos-8s-x86_64' hostname='n27-9-193.pool.ci.centos.org' ipaddr='172.27.9.193' session_id=149576 active=TRUE created_at='2022-11-23 14:54:18.606937+00:00' retired_at= pool='virt-ec2-t2-centos-8s-x86_64' hostname='n27-3-7.pool.ci.centos.org' ipaddr='172.27.3.7' session_id=149580 active=TRUE created_at='2022-11-23 14:54:21.863244+00:00' retired_at= pool='virt-ec2-t2-centos-8s-x86_64' hostname='n27-31-219.pool.ci.centos.org' ipaddr='172.27.31.219' session_id=149582 active=TRUE created_at='2022-11-23 14:54:23.765660+00:00' retired_at= pool='virt-ec2-t2-centos-8s-x86_64' hostname='n27-3-0.pool.ci.centos.org' ipaddr='172.27.3.0' session_id=149588 active=TRUE created_at='2022-11-23 14:54:26.006745+00:00' retired_at= pool='virt-ec2-t2-centos-9s-x86_64' hostname='n27-14-180.pool.ci.centos.org' ipaddr='172.27.14.180' session_id=149590 active=TRUE created_at='2022-11-23 14:54:26.726718+00:00' retired_at= pool='virt-ec2-t2-centos-8s-x86_64' hostname='n27-40-128.pool.ci.centos.org' ipaddr='172.27.40.128' session_id=149594 active=TRUE created_at='2022-11-23 14:54:30.849463+00:00' retired_at= pool='virt-ec2-t2-centos-8s-x86_64' hostname='n27-46-222.pool.ci.centos.org' ipaddr='172.27.46.222' session_id=149596 active=TRUE created_at='2022-11-23 14:54:31.943544+00:00' retired_at= pool='virt-ec2-t2-centos-8s-x86_64' hostname='n27-11-240.pool.ci.centos.org' ipaddr='172.27.11.240' session_id=149603 active=TRUE created_at='2022-11-23 14:54:53.117909+00:00' retired_at= pool='virt-ec2-t2-centos-9s-x86_64' hostname='n27-47-210.pool.ci.centos.org' ipaddr='172.27.47.210' session_id=149608 active=TRUE created_at='2022-11-23 14:54:59.543202+00:00' retired_at= pool='virt-ec2-t2-centos-9s-x86_64' hostname='n27-11-8.pool.ci.centos.org' ipaddr='172.27.11.8'
More occurrences from today:
05:52 CET:
[TASK] TEST-38-FREEZER (/bin/time -v -- make -C test/TEST-38-FREEZER setup run && touch $TESTDIR/pass; rm -fv $TESTDIR/*.img; test -e $TESTDIR/pass) [try 1/2] [RESULT] TEST-38-FREEZER - PASS (log file: /root/testsuite-logs-upstream.Hvf/TEST-38-FREEZER_1_PASS.log) [TASK] TEST-39-EXECRELOAD (/bin/time -v -- make -C test/TEST-39-EXECRELOAD setup run && touch $TESTDIR/pass; rm -fv $TESTDIR/*.img; test -e $TESTDIR/pass) [try 1/2] 2022-11-24 04:47:47,650 [agent-control/fetch_artifacts] INFO: Fetching artifacts from node n27-9-144.pool.ci.centos.org: (remote: ~/testsuite-logs*, local: ./artifacts_3yrmtuu_) 2022-11-24 04:47:47,650 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/scp -r -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no root@n27-9-144.pool.ci.centos.org:~/testsuite-logs* ./artifacts_3yrmtuu_ ssh: Could not resolve hostname n27-9-144.pool.ci.centos.org: Name or service not known 2022-11-24 04:47:53,673 [agent-control/execute_remote_command] WARNING: Fetching artifacts failed 2022-11-24 04:47:53,673 [agent-control/main] INFO: Trying to collect kernel dumps from n27-9-144.pool.ci.centos.org:/var/crash 2022-11-24 04:48:03,676 [agent-control/wait_for_node] INFO: Checking if the node n27-9-144.pool.ci.centos.org is alive (try #0) 2022-11-24 04:48:03,677 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ping -q -c 1 -W 10 n27-9-144.pool.ci.centos.org ping: n27-9-144.pool.ci.centos.org: Name or service not known 2022-11-24 04:48:24,707 [agent-control/wait_for_node] INFO: Checking if the node n27-9-144.pool.ci.centos.org is alive (try #1) 2022-11-24 04:48:24,707 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ping -q -c 1 -W 10 n27-9-144.pool.ci.centos.org ping: n27-9-144.pool.ci.centos.org: Name or service not known 2022-11-24 04:48:45,737 [agent-control/wait_for_node] INFO: Checking if the node n27-9-144.pool.ci.centos.org is alive (try #2) 2022-11-24 04:48:45,738 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ping -q -c 1 -W 10 n27-9-144.pool.ci.centos.org ping: n27-9-144.pool.ci.centos.org: Name or service not known 2022-11-24 04:49:06,771 [agent-control/wait_for_node] INFO: Checking if the node n27-9-144.pool.ci.centos.org is alive (try #3) 2022-11-24 04:49:06,771 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ping -q -c 1 -W 10 n27-9-144.pool.ci.centos.org ping: n27-9-144.pool.ci.centos.org: Name or service not known 2022-11-24 04:49:27,788 [agent-control/wait_for_node] INFO: Checking if the node n27-9-144.pool.ci.centos.org is alive (try #4) 2022-11-24 04:49:27,789 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ping -q -c 1 -W 10 n27-9-144.pool.ci.centos.org ping: n27-9-144.pool.ci.centos.org: Name or service not known 2022-11-24 04:49:48,820 [agent-control/wait_for_node] INFO: Checking if the node n27-9-144.pool.ci.centos.org is alive (try #5) 2022-11-24 04:49:48,820 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ping -q -c 1 -W 10 n27-9-144.pool.ci.centos.org ping: n27-9-144.pool.ci.centos.org: Name or service not known 2022-11-24 04:50:09,851 [agent-control/wait_for_node] INFO: Checking if the node n27-9-144.pool.ci.centos.org is alive (try #6) 2022-11-24 04:50:09,851 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ping -q -c 1 -W 10 n27-9-144.pool.ci.centos.org ping: n27-9-144.pool.ci.centos.org: Name or service not known 2022-11-24 04:50:30,872 [agent-control/wait_for_node] INFO: Checking if the node n27-9-144.pool.ci.centos.org is alive (try #7) 2022-11-24 04:50:30,872 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ping -q -c 1 -W 10 n27-9-144.pool.ci.centos.org ping: n27-9-144.pool.ci.centos.org: Name or service not known 2022-11-24 04:50:51,903 [agent-control/wait_for_node] INFO: Checking if the node n27-9-144.pool.ci.centos.org is alive (try #8) 2022-11-24 04:50:51,903 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ping -q -c 1 -W 10 n27-9-144.pool.ci.centos.org ping: n27-9-144.pool.ci.centos.org: Name or service not known 2022-11-24 04:51:12,934 [agent-control/wait_for_node] INFO: Checking if the node n27-9-144.pool.ci.centos.org is alive (try #9) 2022-11-24 04:51:12,934 [agent-control/execute_local_command] INFO: Executing a LOCAL command: /usr/bin/ping -q -c 1 -W 10 n27-9-144.pool.ci.centos.org ping: n27-9-144.pool.ci.centos.org: Name or service not known 2022-11-24 04:51:33,964 [agent-control/main] ERROR: Execution failed
09:39 - 09:43:
ssh: Could not resolve hostname n27-0-85.pool.ci.centos.org: Name or service not known ssh: Could not resolve hostname n27-45-53.pool.ci.centos.org: Name or service not known
I can't even reach Duffy from the slave01 node:
[Prod - slave01.ci.centos.org] (duffy) [systemd@slave01 ~]$ duffy client --format=flat list-sessions Traceback (most recent call last): File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/httpcore/_exceptions.py", line 8, in map_exceptions yield File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/httpcore/backends/sync.py", line 26, in read return self._sock.recv(max_bytes) File "/opt/rh/rh-python38/root/usr/lib64/python3.8/ssl.py", line 1226, in recv return self.read(buflen) File "/opt/rh/rh-python38/root/usr/lib64/python3.8/ssl.py", line 1101, in read return self._sslobj.read(len) socket.timeout: The read operation timed out During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/httpx/_transports/default.py", line 60, in map_httpcore_exceptions yield File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/httpx/_transports/default.py", line 218, in handle_request resp = self._pool.handle_request(req) File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/httpcore/_sync/connection_pool.py", line 253, in handle_request raise exc File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/httpcore/_sync/connection_pool.py", line 237, in handle_request response = connection.handle_request(request) File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/httpcore/_sync/connection.py", line 90, in handle_request return self._connection.handle_request(request) File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/httpcore/_sync/http11.py", line 105, in handle_request raise exc File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/httpcore/_sync/http11.py", line 84, in handle_request ) = self._receive_response_headers(**kwargs) File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/httpcore/_sync/http11.py", line 148, in _receive_response_headers event = self._receive_event(timeout=timeout) File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/httpcore/_sync/http11.py", line 177, in _receive_event data = self._network_stream.read( File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/httpcore/backends/sync.py", line 26, in read return self._sock.recv(max_bytes) File "/opt/rh/rh-python38/root/usr/lib64/python3.8/contextlib.py", line 131, in __exit__ self.gen.throw(type, value, traceback) File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/httpcore/_exceptions.py", line 12, in map_exceptions raise to_exc(exc) httpcore.ReadTimeout: The read operation timed out The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/home/systemd/venv/duffy/bin/duffy", line 8, in <module> sys.exit(cli()) File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/click/core.py", line 1130, in __call__ return self.main(*args, **kwargs) File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/click/core.py", line 1055, in main rv = self.invoke(ctx) File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/click/core.py", line 1657, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/click/core.py", line 1657, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/click/core.py", line 1404, in invoke return ctx.invoke(self.callback, **ctx.params) File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/click/core.py", line 760, in invoke return __callback(*args, **kwargs) File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/click/decorators.py", line 38, in new_func return f(get_current_context().obj, *args, **kwargs) File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/duffy/cli.py", line 670, in client_list_sessions result = obj["client"].list_sessions() File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/duffy/client/main.py", line 109, in list_sessions return self._query_method( File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/duffy/client/main.py", line 94, in _query_method response = client_method(url=url, **add_kwargs) File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/httpx/_client.py", line 1039, in get return self.request( File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/httpx/_client.py", line 815, in request return self.send(request, auth=auth, follow_redirects=follow_redirects) File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/httpx/_client.py", line 902, in send response = self._send_handling_auth( File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/httpx/_client.py", line 930, in _send_handling_auth response = self._send_handling_redirects( File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/httpx/_client.py", line 967, in _send_handling_redirects response = self._send_single_request(request) File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/httpx/_client.py", line 1003, in _send_single_request response = transport.handle_request(request) File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/httpx/_transports/default.py", line 218, in handle_request resp = self._pool.handle_request(req) File "/opt/rh/rh-python38/root/usr/lib64/python3.8/contextlib.py", line 131, in __exit__ self.gen.throw(type, value, traceback) File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/httpx/_transports/default.py", line 77, in map_httpcore_exceptions raise mapped_exc(message) from exc httpx.ReadTimeout: The read operation timed out
nor from the pods:
00:00:23.659 2022-11-24 08:49:44,791 [agent-control/allocate_node] INFO: Attempting to allocate a node from pool metal-ec2-c5n-centos-8s-x86_64 00:02:19.014 2022-11-24 08:51:40,145 [agent-control/allocate_node] ERROR: [Try 12/1440] Received an API error from the server: Timeout while requesting URL('https://duffy.ci.centos.org/api/v1/sessions')
And trying to ping an AWS node from slave01 yields the known error:
[Prod - slave01.ci.centos.org] (duffy) [systemd@slave01 ~]$ ping n27-45-53.pool.ci.centos.org ping: n27-45-53.pool.ci.centos.org: Name or service not known
Metadata Update from @arrfab: - Issue assigned to arrfab - Issue tagged with: centos-ci-infra, high-gain, high-trouble
(discussed also on irc in #centos-ci on irc.libera.chat but giving summary here) :
From zabbix/monitoring logs I confirm that we had issues and related to the vpn itself going down and then up (between rdu2c DC and remote AWS VPCs).
the Name or service not known issue (when having to deal with node in the pool.ci.centos.org sub-domain is due to that reason : it's a remote (so through vpn) private zone hosted in aws/route53. So when tunnel goes down, local resolvers, forwarding to route53, aren't able to resolve it.
Name or service not known
pool.ci.centos.org
The other issue I see is through interesting and should not be linked to external network : duffy client --format=flat list-sessions is internal (and duffy.ci.centos.org resolves internally too) so duffy api should be able to answer internal queries without having timeouts (but something that @nphilipp can confirm). I would understand if it would happen during a duffy client request-session as it would have to reach remote ec2 node for contextualization and would fail
duffy client --format=flat list-sessions
duffy client request-session
or (and to be investigated) something related to internal network issues (which would then explain both the vpn and internal endpoint issue)
Just an update : after some verifications, we decided ~24h ago to disable a second vpn ipsec tunnel that will be needed for ocp prod in aws and it seems now "stable".
We still don't know (we had other fires to fight this friday) if the underlying issue is network related (temporary) or due to the second ipsec tunnel (which was working fine since monday so some time before first errors appeared).
We have so now two options :
Oh well, I don't want to ruin the Friday evening, but it just happened again:
ssh: Could not resolve hostname n27-18-81.pool.ci.centos.org: Name or service not known 2022-11-25 17:46:54,749 [agent-control/execute_remote_command] WARNING: Fetching artifacts failed
ssh: Could not resolve hostname n27-26-130.pool.ci.centos.org: Name or service not known 2022-11-25 17:47:26,839 [agent-control/execute_remote_command] WARNING: Fetching artifacts failed
ssh: Could not resolve hostname n27-25-177.pool.ci.centos.org: Name or service not known 2022-11-25 17:47:06,493 [agent-control/execute_remote_command] WARNING: Fetching artifacts failed
etc.
And as previously, I can't access Duffy from the slave01 node as well:;
[Prod - slave01.ci.centos.org] (duffy) [systemd@slave01 ~]$ duffy client --format=flat list-sessions Traceback (most recent call last): File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/httpcore/_exceptions.py", line 8, in map_exceptions yield File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/httpcore/backends/sync.py", line 26, in read return self._sock.recv(max_bytes) File "/opt/rh/rh-python38/root/usr/lib64/python3.8/ssl.py", line 1226, in recv return self.read(buflen) File "/opt/rh/rh-python38/root/usr/lib64/python3.8/ssl.py", line 1101, in read return self._sslobj.read(len) socket.timeout: The read operation timed out ... File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/httpx/_transports/default.py", line 218, in handle_request resp = self._pool.handle_request(req) File "/opt/rh/rh-python38/root/usr/lib64/python3.8/contextlib.py", line 131, in __exit__ self.gen.throw(type, value, traceback) File "/home/systemd/venv/duffy/lib64/python3.8/site-packages/httpx/_transports/default.py", line 77, in map_httpcore_exceptions raise mapped_exc(message) from exc httpx.ReadTimeout: The read operation timed out
yes, I confirm that through some running network tests through the vpn tunnel .. :/ What I did this saturday morning was to just revert to previous kernel (that was running fine before the 2nd tunnel was added and vpn node restarted) and let's see how that goes. I also added back 2nd tunnel as clearly it wasn't the issue.
Let me know in the next days how that looks like for you
A couple more occurrences from today (from cico-workspace OCP pods):
# 13:06 UTC ssh: Could not resolve hostname n27-37-102.pool.ci.centos.org: Name or service not known ssh: Could not resolve hostname n27-30-164.pool.ci.centos.org: Name or service not known ssh: Could not resolve hostname n27-15-59.pool.ci.centos.org: Name or service not known # 14:17 UTC ssh: Could not resolve hostname n27-5-154.pool.ci.centos.org: Name or service not known ssh: Could not resolve hostname n27-10-82.pool.ci.centos.org: Name or service not known ssh: Could not resolve hostname n27-20-125.pool.ci.centos.org: Name or service not known
Also (from a Duffy metal node):
# 15:13 UTC Could not resolve host: artifacts.ci.centos.org
Was testing again and with some loops from both sides of the only running tunnel :
40013 packets transmitted, 40002 received, 0.0274911% packet loss, time 40067681ms
and other side for a different target :
40014 packets transmitted, 40006 received, +7 errors, 0% packet loss, time 40066768ms
That seems not "pretty" stable so let's continue to investigate
Some other stats and for a vpn tunnel, that seems correct :
80156 packets transmitted, 80147 received, 0.0112281% packet loss, time 80266810ms
I'd consider this one now working but while troubleshooting that particular vpn I had to establish the second AWS tunnel and also working fine but with some enhancements, so I'll probably also apply (should be really fast) these to the first tunnel . Let me close this one for now
Metadata Update from @arrfab: - Issue close_status updated to: Fixed with Explanation - Issue status updated to: Closed (was: Open)
Log in to comment on this ticket.