From time to time we are getting alerts from the Unbound 443/tcp services on various hosts.
Nagios sends:
DNS WARNING - 3.010 seconds response time (dig returned an error status)
This check sets up a stunnel on noc01 to the host, then uses dig to query over it. Somehow dig is returning an error code and causing the warning. A 3 second response time should not be a warning or error.
This may not be a super easy fix, but it should be a good task for a motivated apprentice.
Try to look at it as a member of fi-apprentice. If you have some sugestions, try to ping me on irc.
I would like to work on this. Found some information on these two links that might be helpful:
http://nagiosplugins.org/man/check_dns
https://wiki.umbc.edu/display/CIG/Nagios+-+DNS+Checks
Well, we already do have the check in place, it's just that it returns a warning sometimes. :(
This one has been in warning a while: https://admin.fedoraproject.org/nagios/cgi-bin//extinfo.cgi?type=2&host=unbound-telia01&service=Unbound+443%2Ftcp
you can login to noc01 and see the check commands it runs to check this host and copy those scripts to try and debug it. ;)
I think there may be a problem with the file check_dig_ssl located within usr/lib64/nagios/plugins:
check_dig_ssl is supposed to create the stunnel and then call check_dig to run over it:
STUNNEL_EXEC=${STUNNEL_EXEC:-/usr/bin/stunnel} CHECK_DIG_EXEC=${CHECK_DIG_EXEC:-/usr/lib/nagios/plugins/check_dig} lport=8443
ARGS="" while getopts "L:H:p:l:T:a:A:w:c:t:v" options do case $options in L ) lport=$OPTARG ;; H ) host=$OPTARG ;; p ) port=$OPTARG ;; * ) ARGS="$ARGS -$options $OPTARG";; esac done
TMPFILE=mktemp /tmp/$(basename $0)_${host}_${port}_XXXXX echo " client = yes verify = 0 syslog = no pid=$TMPFILE.pid [${host}_${port}] accept=${lport} connect=${host}:${port} " > $TMPFILE
mktemp /tmp/$(basename $0)_${host}_${port}_XXXXX
$STUNNEL_EXEC $TMPFILE
$CHECK_DIG_EXEC -H localhost -p ${lport} $ARGS e_status=$?
kill -9 cat $TMPFILE.pid rm -f $TMPFILE $TMPFILE.pid exit $e_status
cat $TMPFILE.pid
Could the nameserver be missing within this file since check_dig itself is working properly as long as I pass it the necessary nameserver information which I pulled from resolv.conf:
./check_dig -H 10.5.126.21 -l unbound-telia01.fedoraproject.org DNS OK - 0.011 seconds response time (unbound-telia01.fedoraproject.org. 300 IN A 80.239.156.220)|time=0.011390s;;;0.000000
Running check_dig without the nameserver appears to recreate the same error we are seeing within nagios:
./check_dig -l unbound-telia01.fedoraproject.org DNS WARNING - 3.014 seconds response time (dig returned an error status)|time=3.013987s;;;0.000000
I used the check_dig manual article to formulate my command: http://nagiosplugins.org/man/check_dig
Well, it's a bit more complicated. ;)
The reason for the stunnel is that we are querying port 443 on the unbound server. Our unbound servers run a tls/ssl dns on tcp/443.
check_dig is unable to directly check this because it doesn't talk ssl, so we have to pass it through the stunnel, so you can see the script passes in localhost there (for the stunnel to port 443/tcp).
If we can come up with a way to check it that doesn't involve the stunnel that would be great too. ;)
I went back and stepped through the check_dig_ssl script again and confirmed that the config file for stunnel is being created and the tunnel appears up. I was also able to telnet to localhost port 8443 (the port being set in the script by lport) Telnet stated connected then subsequently disconnected. I believe this test is sufficient to confirm that the socket is being created successfully however I am not certain...
With the stunnel up, the next step is to run check_dig: ./check_dig -v -H localhost -p 8443 -l unbound-telia01.fedoraproject.org -A "+tcp"
But unfortunately this just returns the same warning that nagios is reporting: DNS WARNING - 0.383 seconds response time (dig returned an error status
STUNNEL {{{ [neldogz@noc01 plugins]$ ps -ef | grep stunnel neldogz 3513 1 0 03:05 pts/0 00:00:00 /usr/bin/stunnel /tmp/-bash_unbound-telia01.fedoraproject.org_443_0A0IS neldogz 3514 1 0 03:05 pts/0 00:00:00 /usr/bin/stunnel /tmp/-bash_unbound-telia01.fedoraproject.org_443_0A0IS neldogz 3515 1 0 03:05 pts/0 00:00:00 /usr/bin/stunnel /tmp/-bash_unbound-telia01.fedoraproject.org_443_0A0IS neldogz 3516 1 0 03:05 pts/0 00:00:00 /usr/bin/stunnel /tmp/-bash_unbound-telia01.fedoraproject.org_443_0A0IS neldogz 3517 1 0 03:05 pts/0 00:00:00 /usr/bin/stunnel /tmp/-bash_unbound-telia01.fedoraproject.org_443_0A0IS neldogz 3518 1 0 03:05 ? 00:00:00 /usr/bin/stunnel /tmp/-bash_unbound-telia01.fedoraproject.org_443_0A0IS neldogz 6370 554 0 03:12 pts/0 00:00:00 grep stunnel
STUNNEL Config File [neldogz@noc01 plugins]$ cat /tmp/-bash_unbound-telia01.fedoraproject.org_443_0A0IS
client = yes verify = 0 syslog = no pid=/tmp/-bash_unbound-telia01.fedoraproject.org_443_0A0IS.pid [unbound-telia01.fedoraproject.org_443] accept=8443 connect=unbound-telia01.fedoraproject.org:443
}}}
At this point I am not sure why check_dig is not resolving a request over the tunnel. Some things that I can think of checking but I don't have access to are the following:
1) Is there anything within var/log/messages that could be useful? 2) Confirm if there is any kind of packet filtering being performed between Noc01 and unbound-telia01 which could be preventing this to succeed. 3) Confirm that the ssl-dns service is indeed servicing those types of dns requests on unbound-telia01. 4) Confirm if anyone knows of a way to test an end-to-end stunnel connectivity between these servers? Maybe somehow confirm the stunnel goes up on unbound-telia01 when nagios kicks off the scheduled check? I am not sure if this is how an stunnel is supposed to behave.
Additionally I thew a post on the nagios forums: http://support.nagios.com/forum/viewtopic.php?f=7&t=5593
Thats a lot for looking into this!
I don't think so, but can look.
There should be no packet filtering between them on port 443. Thats going over our vpn.
I'll have to ask the unbound maintainer about this. I'll note we have 2 more unbound servers that don't seem to show this warning, they seem to show ok all the time. ;(
It seems to work for the other 2 instances we are monitoring.
I can see if I can get the unbound maintainer to look at this some... adding them to the ticket...
Any ideas?
Hi, sorry for the late response. As i'm the one who wrote this bad script (check_dig_ssl) i thought i ought to have a look on this.
First, check_dig_ssl is just doing a ssl tunnel and runs check_dig via this. And here we have 2 issues: - The check_dig script reports warning instead of the real error - unbound at telia01 seems to not work as it should.
Regarding the first issue, check_dig seems to hide that unbound at telia immediately closes the TCP connection after connection is established:
{{{ [ctria@noc01 ~]$ telnet unbound-telia01.fedoraproject.org 443 Trying 80.239.156.220... Connected to unbound-telia01.fedoraproject.org. Escape character is '^]'. Connection closed by foreign host. }}}
I'm not sure if i can fix this wrong error reporting on check_dig_ssl side, probably no.
Regarding the unbound issue at telia01, i guess something useful should be find at logs at that host (sorry no access there).
Regards,
Christos
Something was odd with the unbound process on unbound-telia01, so I restarted the service. It no longer hangs when connected to, but still returns the warning. ;(
The problem seems to have been resolved sometime after the restarting of the unbound process on unbound-telia01.
The problem is no longer being reported by Nagios: https://admin.fedoraproject.org/nagios/cgi-bin//extinfo.cgi?type=2&host=unbound-telia01&service=Unbound+443%2Ftcp
Confirmed this with nirik.
Login to comment on this ticket.