#797 Substantial packetloss to the CentOS CI OCP instance
Closed: Fixed with Explanation 2 years ago by arrfab. Opened 2 years ago by mrc0mmand.

Since an ~half an hour ago I basically can't load anything from the CentOS CI OCP cluster (https://console-openshift-console.apps.ocp.ci.centos.org/), as it takes a ridiculous amount of time. mtr shows that there's something wrong with the network near the cluster:

From localhost:

...
 8. viawest-svc073699-ic361683.ip.twelve99-cust.net       0.0%   525   99.6  99.8  99.5 110.7   0.6
 9. be23.bbrt01.iad01.flexential.net                      0.0%   525  107.7 107.8 107.5 110.2   0.2
10. be185.bbrt02.ral01.flexential.net                     0.0%   525  108.0 108.1 107.7 119.9   0.5
11. be32.crrt02.ral01.flexential.net                      0.0%   525  110.0 110.1 109.9 117.3   0.5
12. 128.136.224.140                                       0.2%   525  108.2 112.1 107.9 148.2   8.2
13. 8.43.84.1                                             0.0%   525  133.7 152.2 122.6 320.6  27.8
14. 8.43.84.3                                             8.0%   525  110.7 112.1 108.2 170.4   8.4
15. 8.43.84.4                                             0.0%   525  127.5 133.4 112.5 293.7  18.1
16. 8.43.84.254                                           3.4%   525  737.2 589.3 110.6 2369. 548.0
17. 8.43.84.248                                          73.5%   525  110.1 115.3 110.0 152.0  10.0

From one of the test machines provided by Fedora:

...
29. be-133-pe01.seattle.wa.ibone.comcast.net     0.0%   163    9.7   9.8   9.5  10.9   0.2
30. 23.30.206.10                                 0.0%   163    9.7   9.7   9.4  10.5   0.2
31. be21.bbrt01.sea01.flexential.net             0.0%   163   81.4  81.5  81.2  83.0   0.2
32. be101.bbrt01.pdx01.flexential.net            0.0%   163   81.7  81.6  81.3  82.0   0.2
33. be10.bbrt02.pdx01.flexential.net             0.0%   163   81.4  81.2  80.9  81.7   0.1
34. be198.bbrt02.msp01.flexential.net            0.0%   163   81.5  81.5  81.2  81.9   0.1
35. be173.bbrt02.msp10.flexential.net            0.0%   163   81.2  81.3  81.0  81.7   0.1
36. be10.bbrt01.msp10.flexential.net             0.0%   163   81.3  81.5  81.3  82.6   0.2
37. be188.bbrt02.chi01.flexential.net            0.0%   163   81.3  81.5  81.2  81.9   0.1
38. be10.bbrt01.chi10.flexential.net             0.0%   163   81.6  81.5  81.3  81.9   0.1
39. be204.bbrt02.cin01.flexential.net            0.0%   163   81.8  81.5  81.2  81.9   0.1
40. be192.bbrt02.iad01.flexential.net            0.0%   162   81.4  81.4  81.1  82.1   0.2
41. be10.bbrt01.iad01.flexential.net             0.0%   162   81.7  81.4  81.3  81.9   0.1
42. be185.bbrt02.ral01.flexential.net            0.0%   162   81.4  81.4  81.2  82.3   0.2
43. be32.crrt02.ral01.flexential.net             0.0%   162   81.5  81.6  81.4  83.6   0.2
44. 128.136.224.140                              0.0%   162   83.0  87.9  81.5 130.6  10.9
45. 8.43.84.1                                    0.0%   162  103.4 121.8  96.1 223.7  16.9
46. 8.43.84.3                                   20.4%   162   81.8  88.9  81.7 121.4  11.2
47. 8.43.84.4                                    0.0%   162   95.0 107.6  86.1 212.6  20.1
48. 8.43.84.254                                  1.9%   162   84.0 603.3  82.4 1942. 528.3
49. 8.43.84.248                                 77.0%   162   82.1  88.0  82.0 117.4  10.1

Metadata Update from @arrfab:
- Issue assigned to arrfab

2 years ago

Metadata Update from @arrfab:
- Issue priority set to: 🔥 Urgent 🔥 (was: Needs Review)
- Issue tagged with: centos-ci-infra, high-gain, medium-trouble

2 years ago

Just commenting on this today (saturday) but issue was directly resolved friday evening.
It's all fixed, and was confirmed on #centos-ci channel on libera.chat .

Explanations : a switch port reconfiguration happened to move part of infra in CI but it seems it confused upstream switch for some seconds. While that part was fixed directly (basically two interfaces for a host acting as hypervisor were in the same bridge+vlan), it confused other parts of the network. It was all ok but for the two gateways, using keepalived and so suddenly in "split brain" mode, so just a reboot of these nodes (in serial) solved the issue

Metadata Update from @arrfab:
- Issue close_status updated to: Fixed with Explanation
- Issue status updated to: Closed (was: Open)

2 years ago

Login to comment on this ticket.

Metadata
Boards 1
CentOS CI Infra Status: Backlog