#11097 wildcard.fedoraproject.org Health Checking
Closed: Fixed a year ago by kevin. Opened a year ago by jsteffan.

It would be great if we could remove failing or degraded servers from the main round-robin.

For example, the getfedora.org destination, and basically everything this pool of servers is handling, has a 20-30% failure rate as of this moment.

When using Cloudflare ipv6 WARP via Comcast Denver, CO, US:

time while true; do for dest in $(dig +short -6 -t AAAA getfedora.org); do curl -m 5 -so /dev/null --resolve getfedora.org:443:$dest https://getfedora.org/ && echo -n '.'|| echo "$dest failed to respond"; done; done
2600:1f14:fad:5c02:3556:f9be:1874:bdc0 failed to respond
....2620:52:3:1:dead:beef:cafe:fed7 failed to respond
2600:1f14:fad:5c02:3556:f9be:1874:bdc0 failed to respond
....2620:52:3:1:dead:beef:cafe:fed7 failed to respond
2600:1f14:fad:5c02:3556:f9be:1874:bdc0 failed to respond
....2620:52:3:1:dead:beef:cafe:fed7 failed to respond
2600:1f14:fad:5c02:3556:f9be:1874:bdc0 failed to respond
....2620:52:3:1:dead:beef:cafe:fed7 failed to respond
2600:1f14:fad:5c02:3556:f9be:1874:bdc0 failed to respond
....2620:52:3:1:dead:beef:cafe:fed7 failed to respond
2600:1f14:fad:5c02:3556:f9be:1874:bdc0 failed to respond
....2620:52:3:1:dead:beef:cafe:fed7 failed to respond
2600:1f14:fad:5c02:3556:f9be:1874:bdc0 failed to respond
....2620:52:3:1:dead:beef:cafe:fed7 failed to respond
2600:1f14:fad:5c02:3556:f9be:1874:bdc0 failed to respond
....2620:52:3:1:dead:beef:cafe:fed7 failed to respond
2600:1f14:fad:5c02:3556:f9be:1874:bdc0 failed to respond
....2620:52:3:1:dead:beef:cafe:fed7 failed to respond
2600:1f14:fad:5c02:3556:f9be:1874:bdc0 failed to respond
....2620:52:3:1:dead:beef:cafe:fed7 failed to respond
2600:1f14:fad:5c02:3556:f9be:1874:bdc0 failed to respond
....2620:52:3:1:dead:beef:cafe:fed7 failed to respond
2600:1f14:fad:5c02:3556:f9be:1874:bdc0 failed to respond
....2620:52:3:1:dead:beef:cafe:fed7 failed to respond
2600:1f14:fad:5c02:3556:f9be:1874:bdc0 failed to respond
....^C

real    2m26.944s
user    0m3.296s
sys 0m0.570s

When using Comcast IPv4 in Denver, CO, US:

$ time while true; do for ipv4 in $(dig +short -4 -t A getfedora.org); do curl -m 5 -so /dev/null --resolve "getfedora.org:443:$ipv4" https://getfedora.org && echo -n '.' || echo "failed $ipv4"; done; done
failed 8.43.85.67
.failed 35.81.0.97
.......failed 8.43.85.67
.failed 35.81.0.97
.......failed 8.43.85.67
.failed 35.81.0.97
.......failed 8.43.85.67
.failed 35.81.0.97
.......failed 8.43.85.67
.failed 35.81.0.97
.......failed 8.43.85.67
.failed 35.81.0.97
.......failed 8.43.85.67
.failed 35.81.0.97
.......failed 8.43.85.67
.failed 35.81.0.97
.......failed 8.43.85.67
.failed 35.81.0.97
.......failed 8.43.85.67
.failed 35.81.0.97
.......failed 8.43.85.67
.failed 35.81.0.97
.......failed 8.43.85.67
.failed 35.81.0.97
.......failed 8.43.85.67
.failed 35.81.0.97
.......failed 8.43.85.67
.failed 35.81.0.97
.......failed 8.43.85.67
.failed 35.81.0.97
.......failed 8.43.85.67
.failed 35.81.0.97
.......failed 8.43.85.67
.failed 35.81.0.97
.......failed 8.43.85.67
.failed 35.81.0.97
.......failed 8.43.85.67
.failed 35.81.0.97
.......failed 8.43.85.67
.failed 35.81.0.97
.......failed 8.43.85.67
.failed 35.81.0.97
.......failed 8.43.85.67
.failed 35.81.0.97
.......failed 8.43.85.67
.failed 35.81.0.97
.......failed 8.43.85.67
.failed 35.81.0.97
.......failed 8.43.85.67
.^C

real    4m51.716s
user    0m9.983s
sys 0m1.686s

Yes it would be. Do you have time to help work on how this would be accomplished?

My guess would be that the dns system would need to be overhauled and be able to be triggered from the monitoring solution (currently nagios).

Going through nagios logs, 8.43.85.67/2620:52:3:1:dead:beef:cafe:fed7 was down due httpd starvation and needed a httpd restart.

Proxy09 (35.81.0.97/2600:1f14:fad:5c02:7c8a:72d0:1c58:c189) never alerted and was completely reachable by the two nagios systems around the time you opened the ticket. I do not know what would have caused that problem.

This should all be fixed now.

Timing was unfortunate here as I just went to sleep right before this started happening and apparently no sysadmin-web folks were around overnight. ;(

Proxy09 had the wrong (old) ip in getfedora and other zones. That was my fault when I reprovisioned it I didn't update them, only fedoraproject.org. ;(

As to root casue:

[Sat Jan 21 06:01:28.081881 2023] [mpm_event:alert] [pid 2099954:tid 2099954] AH02324: A resource shor
tage or other unrecoverable failure was encountered before any child process initialized successfully.
.. httpd is exiting!

but I can't seem to find anything else. ;(

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

a year ago

Login to comment on this ticket.

Metadata