It would be great if we could remove failing or degraded servers from the main round-robin.
For example, the getfedora.org destination, and basically everything this pool of servers is handling, has a 20-30% failure rate as of this moment.
When using Cloudflare ipv6 WARP via Comcast Denver, CO, US:
time while true; do for dest in $(dig +short -6 -t AAAA getfedora.org); do curl -m 5 -so /dev/null --resolve getfedora.org:443:$dest https://getfedora.org/ && echo -n '.'|| echo "$dest failed to respond"; done; done 2600:1f14:fad:5c02:3556:f9be:1874:bdc0 failed to respond ....2620:52:3:1:dead:beef:cafe:fed7 failed to respond 2600:1f14:fad:5c02:3556:f9be:1874:bdc0 failed to respond ....2620:52:3:1:dead:beef:cafe:fed7 failed to respond 2600:1f14:fad:5c02:3556:f9be:1874:bdc0 failed to respond ....2620:52:3:1:dead:beef:cafe:fed7 failed to respond 2600:1f14:fad:5c02:3556:f9be:1874:bdc0 failed to respond ....2620:52:3:1:dead:beef:cafe:fed7 failed to respond 2600:1f14:fad:5c02:3556:f9be:1874:bdc0 failed to respond ....2620:52:3:1:dead:beef:cafe:fed7 failed to respond 2600:1f14:fad:5c02:3556:f9be:1874:bdc0 failed to respond ....2620:52:3:1:dead:beef:cafe:fed7 failed to respond 2600:1f14:fad:5c02:3556:f9be:1874:bdc0 failed to respond ....2620:52:3:1:dead:beef:cafe:fed7 failed to respond 2600:1f14:fad:5c02:3556:f9be:1874:bdc0 failed to respond ....2620:52:3:1:dead:beef:cafe:fed7 failed to respond 2600:1f14:fad:5c02:3556:f9be:1874:bdc0 failed to respond ....2620:52:3:1:dead:beef:cafe:fed7 failed to respond 2600:1f14:fad:5c02:3556:f9be:1874:bdc0 failed to respond ....2620:52:3:1:dead:beef:cafe:fed7 failed to respond 2600:1f14:fad:5c02:3556:f9be:1874:bdc0 failed to respond ....2620:52:3:1:dead:beef:cafe:fed7 failed to respond 2600:1f14:fad:5c02:3556:f9be:1874:bdc0 failed to respond ....2620:52:3:1:dead:beef:cafe:fed7 failed to respond 2600:1f14:fad:5c02:3556:f9be:1874:bdc0 failed to respond ....^C real 2m26.944s user 0m3.296s sys 0m0.570s
When using Comcast IPv4 in Denver, CO, US:
$ time while true; do for ipv4 in $(dig +short -4 -t A getfedora.org); do curl -m 5 -so /dev/null --resolve "getfedora.org:443:$ipv4" https://getfedora.org && echo -n '.' || echo "failed $ipv4"; done; done failed 8.43.85.67 .failed 35.81.0.97 .......failed 8.43.85.67 .failed 35.81.0.97 .......failed 8.43.85.67 .failed 35.81.0.97 .......failed 8.43.85.67 .failed 35.81.0.97 .......failed 8.43.85.67 .failed 35.81.0.97 .......failed 8.43.85.67 .failed 35.81.0.97 .......failed 8.43.85.67 .failed 35.81.0.97 .......failed 8.43.85.67 .failed 35.81.0.97 .......failed 8.43.85.67 .failed 35.81.0.97 .......failed 8.43.85.67 .failed 35.81.0.97 .......failed 8.43.85.67 .failed 35.81.0.97 .......failed 8.43.85.67 .failed 35.81.0.97 .......failed 8.43.85.67 .failed 35.81.0.97 .......failed 8.43.85.67 .failed 35.81.0.97 .......failed 8.43.85.67 .failed 35.81.0.97 .......failed 8.43.85.67 .failed 35.81.0.97 .......failed 8.43.85.67 .failed 35.81.0.97 .......failed 8.43.85.67 .failed 35.81.0.97 .......failed 8.43.85.67 .failed 35.81.0.97 .......failed 8.43.85.67 .failed 35.81.0.97 .......failed 8.43.85.67 .failed 35.81.0.97 .......failed 8.43.85.67 .failed 35.81.0.97 .......failed 8.43.85.67 .failed 35.81.0.97 .......failed 8.43.85.67 .failed 35.81.0.97 .......failed 8.43.85.67 .^C real 4m51.716s user 0m9.983s sys 0m1.686s
Yes it would be. Do you have time to help work on how this would be accomplished?
My guess would be that the dns system would need to be overhauled and be able to be triggered from the monitoring solution (currently nagios).
Going through nagios logs, 8.43.85.67/2620:52:3:1:dead:beef:cafe:fed7 was down due httpd starvation and needed a httpd restart.
Proxy09 (35.81.0.97/2600:1f14:fad:5c02:7c8a:72d0:1c58:c189) never alerted and was completely reachable by the two nagios systems around the time you opened the ticket. I do not know what would have caused that problem.
This should all be fixed now.
Timing was unfortunate here as I just went to sleep right before this started happening and apparently no sysadmin-web folks were around overnight. ;(
Proxy09 had the wrong (old) ip in getfedora and other zones. That was my fault when I reprovisioned it I didn't update them, only fedoraproject.org. ;(
As to root casue:
[Sat Jan 21 06:01:28.081881 2023] [mpm_event:alert] [pid 2099954:tid 2099954] AH02324: A resource shor tage or other unrecoverable failure was encountered before any child process initialized successfully. .. httpd is exiting!
but I can't seem to find anything else. ;(
Metadata Update from @kevin: - Issue close_status updated to: Fixed - Issue status updated to: Closed (was: Open)
Login to comment on this ticket.