#9850 TCP dns query failure for fedoraproject.org
Closed: Fixed 3 years ago by smooge. Opened 3 years ago by misc.

Describe what you would like us to do:


Someone came on #fedora-admin to report the following problem:

11:05:12|  Habbie> the core of the problem is this:
11:05:14|  Habbie> $ dig dnskey fedoraproject.org @ns02.fedoraproject.org +dnssec +bufsize=1232
11:05:16|  Habbie> ;; Truncated, retrying in TCP mode.
11:05:18|  Habbie> and then it just sits there
11:05:34|  Habbie> after 30 seconds dig gives up:
11:05:36|  Habbie> ;; connection timed out; no servers could be reached
11:05:38|  Habbie> ;; Connection to 152.19.134.139#53(152.19.134.139) for fedoraproject.org failed: timed out.

The reporter faced issues when trying to log on pagure, hence me opening the ticket.

I can reproduce the issue with the command on various systems around the world. I checked tcpdump, no icmp errors. The firewall seems open in ansible on TCP/53.

We suspect this is triggered by #9422 (eg, having twice as more keys than before), but tcp queries should be supported. To check, the reporter also gave this url: https://ednscomp.isc.org/ednscomp/7466a4ab46

This indiacte that the problem is not network specific, nor external firewall related (cause, following best practice in shifting the blame, that's what I tried first).

I am not bind-fluent enough to debug that without access.

When do you need this to be done by? (YYYY/MM/DD)


Dunno, when you have time I guess


OK I am not sure what it going on here.. TCP dns is available but these commands only work on localhost

[root@ns01 ~][PROD-IAD2]# dig A fedoraproject.org @localhost +dnssec +tcp

; <<>> DiG 9.11.20-RedHat-9.11.20-5.el8_3.1 <<>> A fedoraproject.org @localhost +dnssec +tcp
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 349
;; flags: qr aa rd; QUERY: 1, ANSWER: 16, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags: do; udp: 4096
; COOKIE: 84e44c0a7677551e9318c2b2606da4374fca08fd6aaf6d90 (good)
;; QUESTION SECTION:
;fedoraproject.org.             IN      A

;; ANSWER SECTION:
fedoraproject.org.      60      IN      A       38.145.60.20
fedoraproject.org.      60      IN      A       13.244.113.71
fedoraproject.org.      60      IN      A       13.250.126.156
fedoraproject.org.      60      IN      A       152.19.134.142
fedoraproject.org.      60      IN      A       38.145.60.21
fedoraproject.org.      60      IN      A       185.141.165.254
fedoraproject.org.      60      IN      A       18.185.136.17
fedoraproject.org.      60      IN      A       13.125.120.8
fedoraproject.org.      60      IN      A       209.132.190.2
fedoraproject.org.      60      IN      A       8.43.85.67
fedoraproject.org.      60      IN      A       140.211.169.196
fedoraproject.org.      60      IN      A       140.211.169.206
fedoraproject.org.      60      IN      A       152.19.134.198
fedoraproject.org.      60      IN      A       67.219.144.68
fedoraproject.org.      60      IN      RRSIG   A 5 2 60 20210505212747 20210405212747 7725 fedoraproject.org. t1ilU2EI8hmaVJmcvDjWZqKsjN/QcEKRidwwJbbJgC5ni+iZuWMOx/II wyRTsN+4y0SyH1wQGmVaiO
QisCY0G4TtTcZ3YtiIduigAHKU7hKZS0Y1 cKG0Yks0AkyjgW/49LT7Yh7IFgOhnIg+xP9kqM0ejRUrIXin1ZjNIQri sA4=
fedoraproject.org.      60      IN      RRSIG   A 14 2 60 20210505212747 20210405212747 60624 fedoraproject.org. E4MJ3fumY6jfvQivuNAly45cBWEbaQiulf3he4eBvfMXgqlDYN2UCIrg isylrbS/oYUcHo5Udtg7
7OeTQm1tFyzSuVeBgEplwCK/AWRn3s9sd8pM cZ4vyPQJ6ljR6RZa

;; Query time: 0 msec
;; SERVER: ::1#53(::1)
;; WHEN: Wed Apr 07 12:23:19 GMT 2021
;; MSG SIZE  rcvd: 620

Doing this on the same network without a firewall I get

$  dig A fedoraproject.org @ns01.iad2.fedoraproject.org +dnssec +tcp
;; Connection to 10.3.163.33#53(10.3.163.33) for fedoraproject.org failed: timed out.
;; Connection to 10.3.163.33#53(10.3.163.33) for fedoraproject.org failed: timed out.

; <<>> DiG 9.11.20-RedHat-9.11.20-5.el8_3.1 <<>> A fedoraproject.org @ns01.iad2.fedoraproject.org +dnssec +tcp
;; global options: +cmd
;; connection timed out; no servers could be reached
;; Connection to 10.3.163.33#53(10.3.163.33) for fedoraproject.org failed: timed out.

So the issue is a config issue. It also fails on the local host if I change 127.0.0.1 to 10.3.163.33

dig dnskey fedoraproject.org @10.3.163.33 +dnssec +bufsize=1232
;; Truncated, retrying in TCP mode.
;; Connection to 10.3.163.33#53(10.3.163.33) for fedoraproject.org failed: timed out.
;; Connection to 10.3.163.33#53(10.3.163.33) for fedoraproject.org failed: timed out.

; <<>> DiG 9.11.20-RedHat-9.11.20-5.el8_3.1 <<>> dnskey fedoraproject.org @10.3.163.33 +dnssec +bufsize=1232
;; global options: +cmd
;; connection timed out; no servers could be reached
;; Connection to 10.3.163.33#53(10.3.163.33) for fedoraproject.org failed: timed out.

Metadata Update from @smooge:
- Issue assigned to smooge

3 years ago

Metadata Update from @smooge:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: dns, medium-gain, medium-trouble, ops

3 years ago

I'm also seeing problems with fedoraproject.org DNS, I assume same root cause. Symptoms for me are hourly yum-cron emails containing:

Could not get metalink https://mirrors.fedoraproject.org/metalink?repo=epel-7&arch=x86_64&infra=stock&content=centos error was
14: curl#6 - "Could not resolve host: mirrors.fedoraproject.org; Unknown error"

whenever I try to manually reproduce/diagnose the issue, at best the results I see are "random" - typically my manual dig queries for mirrors.fedoraproject.org work and I have run out of hair to pull out. It does "feel" like a dns packet size issue, and doubling up on dnssec keys could well cause that, but the timing doesn't fit any details in #9422 - I saw the problem start at 04:00 GMT on 2021-04-05 (Monday).

I'm running my own resolver on CentOS 7, with native IPv6 connectivity, and happy to run any queries that might help narrow down the root cause.

OK the issue is that our DNS servers are getting overwhelmed on TCP connections. I firewalled off a nameserver from TCP outside world except for one ip address and I could get TCP connections working externally.

@nosnilmot the time of 0400 GMT is suspicious as that is the thunder buffalo of every CentOS/Fedora/Oracle/etc system doing cron.daily and which usually has a refresh the local yum caches.

hmm, at least on the system in question 04:00 localtime is 03:00 GMT, but that may have been the cause for the initial early failure - I didn't keep all emails.

It has been failing hourly for at least the last 36 hours though. I suspect the overload in TCP connections is a result of clients retrying with TCP after failure of the UDP queries, rather than the root problem.

In fedora-dns.git, the TTL in built/AFR/fedoraproject.org.signed is 300 for the vast majority of zone entries. This seems far too short, please consider undoing or amending commit #28dd22376.

First thing to change is the named.conf . This PR will be needed to look at and approved for Freeze

https://pagure.io/fedora-infra/ansible/pull-request/532

before the change

# rndc status
WARNING: key file (/etc/rndc.key) exists, but using default configuration file (/etc/rndc.conf)
version: BIND 9.11.20-RedHat-9.11.20-5.el8_3.1 (Extended Support Version) <id:f3d1d66> (cowbell++)
running on ns01.iad2.fedoraproject.org: Linux x86_64 4.18.0-240.15.1.el8_3.x86_64 #1 SMP Wed Feb 3 03:12:15 EST 2021
boot time: Wed, 07 Apr 2021 13:10:17 GMT
last configured: Wed, 07 Apr 2021 13:10:19 GMT
configuration file: /etc/named.conf
CPUs found: 2
worker threads: 2
UDP listeners per interface: 1
number of zones: 914 (194 automatic)
debug level: 0
xfers running: 0
xfers deferred: 0
soa queries in progress: 0
query logging is ON
recursive clients: 0/900/1000
tcp clients: 150/150
TCP high-water: 150
server is up and running

and 10 seconds after the change

# rndc status
WARNING: key file (/etc/rndc.key) exists, but using default configuration file (/etc/rndc.conf)
version: BIND 9.11.20-RedHat-9.11.20-5.el8_3.1 (Extended Support Version) <id:f3d1d66> (cowbell++)
running on ns01.iad2.fedoraproject.org: Linux x86_64 4.18.0-240.15.1.el8_3.x86_64 #1 SMP Wed Feb 3 03:12:15 EST 2021
boot time: Wed, 07 Apr 2021 18:01:41 GMT
last configured: Wed, 07 Apr 2021 18:01:43 GMT
configuration file: /etc/named.conf
CPUs found: 2
worker threads: 2
UDP listeners per interface: 1
number of zones: 914 (194 automatic)
debug level: 0
xfers running: 0
xfers deferred: 0
soa queries in progress: 0
query logging is ON
recursive clients: 0/900/1000
tcp clients: 1000/1000
TCP high-water: 1000
server is up and running

I think we are going to need a bigger server.

However TCP now returns

$ dig dnskey fedoraproject.org @ns02.fedoraproject.org +dnssec +bufsize=1232
;; Truncated, retrying in TCP mode.

; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.68.rc1.el6_10.8 <<>> dnskey fedoraproject.org @ns02.fedoraproject.org +dnssec +bufsize=1232
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 63028
;; flags: qr aa rd; QUERY: 1, ANSWER: 8, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags: do; udp: 4096
;; QUESTION SECTION:
;fedoraproject.org.     IN  DNSKEY

;; ANSWER SECTION:
fedoraproject.org.  300 IN  DNSKEY  256 3 5 AwEAAcCWNQWl5pCI3iOOP2r8nStL60Zjb/2JQLQytamVap0L44z0YWft u7pu0hx3cnIM1ejQOsEwbg2/10IyC+38cYqJDXbSdFg1zGztOS5xNz7r 9hzSRK5N2jkycdJ/BoByJ4Y+XGpDqfG4I97++8sIzSrw60TmGAKTvM9v iL3ByeCN
fedoraproject.org.  300 IN  DNSKEY  257 3 5 AwEAAdTXJc0joiKGfTvLXi+LXxGpKvPvOoJEst9PR8TCCvXGVp7h3BY3 uXLkjckuT0aopCp2KF8zHgNgpMK03p1fd94pn9JZSuxfqvKsiYH2KvNO a/655oPj06jRhqAP5grX01Iz4BH411ZhGxIQ1BzZtOr1wAazojMJzLUg ChRJs8GVt3LU0e6T8z1RQF33Dt9UMHIR5EAsFAqfZ/tsbfJDYktGoZi3 nFlW7A745+ObM1LNXOWq3FcYPVzhH08Q7/7WpxmzM6/ET8VeqWIsvh8E nZNDNMfJyPbY9B1BOIrFCpE03ALgFMejaBZwmeQaX+D4Duup5xGOmdtC O4GSpM1YH6c=
fedoraproject.org.  300 IN  DNSKEY  257 3 14 7ttmhus8JD56ybsvMVZVsXa3U2R+2+WmOPIP7BU6t2LicosMZ2Ju3pfv ijsa5LvBvVCB4xVtLSqEdLSvW4vJPLSAB2uyJwHPJMezh0SzGmVCImLU 6qDxsxjHqtZ76/Sf
fedoraproject.org.  300 IN  DNSKEY  256 3 14 04ZsDOgyzs3kJsJ4jEY3MYufkCOWm1OI8N4M+dlBOBmweln0TSaKfafH zNCkaPiVG4bdgdnrzwxmjpK5GQgsiB47np+I8850Ea3EJG5ORDl3f//l rr92HiYh5DxCNhkG
fedoraproject.org.  300 IN  RRSIG   DNSKEY 5 2 300 20210505212739 20210405212739 7725 fedoraproject.org. GBSZZ0yxVmaTWYAJM1S1ICMoyTETuMQE9fbW1fz91/0WJSPPDNFWhc6p POKHCDa1PEuQPbm3wt745U0F9praIxQUHNeGa//3LELQQ/od5baV2CxZ tqqovbaJD0k9P2VycCYcIxzHQMuzEeFkukZgKMrcxQ8cPC+x3UJ9wFBh gEo=
fedoraproject.org.  300 IN  RRSIG   DNSKEY 5 2 300 20210505212739 20210405212739 16207 fedoraproject.org. PTUxf7vfD6aGCln4FzGb5u+YdL1zSKkx1qcvOIkxv0ihIZHWxCT+uz5+ yVMSzEc9x9/eDCjpPADmeOcD0dEuOR1cdtQKZbq1JVpnY7im4UIqHpFH Bm1Y160L3hRmW58vyRNYawW2SqHPXJ3QmVL2CSsi+jAuB2FTM/ak+abK CZXYrtGGsV3tmOuWRAD8l4egXLtNNL8QHAFOgWpK1Q5zsqYOU2QT2Ti7 V65feEqCYuoDD68s/5mFSnu19UKKzJrbemwHAtUwE58J43GinBill6JM Xhxv1cEPzIcx6NWjV9Hb2XH8J5jLKKnVhfVWmPmGfex8+zbWmmPfyV7a 2pudLA==
fedoraproject.org.  300 IN  RRSIG   DNSKEY 14 2 300 20210505212739 20210405212739 58125 fedoraproject.org. 7fP/+vuQg4aXf3dtSzSD8tbsSJIhbJ2mxaq2RSaogluwBzcCjedZQVUp Cyd9w2WcxyhWN+eF0RxrNZq9XcrLNR9fWStNv37PKCvQcfdjbxuYoKuk BYk5PSf0h861DaTf
fedoraproject.org.  300 IN  RRSIG   DNSKEY 14 2 300 20210505212739 20210405212739 60624 fedoraproject.org. WzBKZU0FfJhar+d2dMpQpaQprbsLuYM1CMaASqzfmWH2SCYWAqXdUg1W DDsdYs8NkWAkaDBbSSHyoBt65Ln05xHl1bWhe3zAoAOSOe5prbcIIBMx SjZ1NfeeojDPDh7j

;; Query time: 39 msec
;; SERVER: 152.19.134.139#53(152.19.134.139)
;; WHEN: Wed Apr  7 14:06:01 2021
;; MSG SIZE  rcvd: 1466

I think this can be closed as no longer an emergency and the mini-initiative in https://pagure.io/fedora-infrastructure/issue/9852 should be looked at to fix it long term.

Metadata Update from @smooge:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

3 years ago

Login to comment on this ticket.

Metadata
Boards 1
ops Status: Done