#3231 Figure out why unbound ssl monitoring breaks from time to time
Closed: Fixed None Opened 12 years ago by kevin.

From time to time we are getting alerts from the Unbound 443/tcp services on various hosts.

Nagios sends:

DNS WARNING - 3.010 seconds response time (dig returned an error status)

This check sets up a stunnel on noc01 to the host, then uses dig to query over it.
Somehow dig is returning an error code and causing the warning.
A 3 second response time should not be a warning or error.


This may not be a super easy fix, but it should be a good task for a motivated apprentice.

Try to look at it as a member of fi-apprentice. If you have some sugestions, try to ping me on irc.

I would like to work on this. Found some information on these two links that might be helpful:

http://nagiosplugins.org/man/check_dns

https://wiki.umbc.edu/display/CIG/Nagios+-+DNS+Checks

Well, we already do have the check in place, it's just that it returns a warning sometimes. :(

This one has been in warning a while:
https://admin.fedoraproject.org/nagios/cgi-bin//extinfo.cgi?type=2&host=unbound-telia01&service=Unbound+443%2Ftcp

you can login to noc01 and see the check commands it runs to check this host and copy those scripts to try and debug it. ;)

I think there may be a problem with the file check_dig_ssl located within usr/lib64/nagios/plugins:

check_dig_ssl is supposed to create the stunnel and then call check_dig to run over it:

!/bin/bash

29-02-2012

Author: Christos Triantafyllidis christos.triantafyllidis@gmail.com

Default values

STUNNEL_EXEC=${STUNNEL_EXEC:-/usr/bin/stunnel}
CHECK_DIG_EXEC=${CHECK_DIG_EXEC:-/usr/lib/nagios/plugins/check_dig}
lport=8443

ARGS=""
while getopts "L:H:p:l:T:a:A:w:c:t:v" options
do
case $options in
L ) lport=$OPTARG ;;
H ) host=$OPTARG ;;
p ) port=$OPTARG ;;
* ) ARGS="$ARGS -$options $OPTARG";;
esac
done

Create a ssl tunnel to the request socket

TMPFILE=mktemp /tmp/$(basename $0)_${host}_${port}_XXXXX
echo "
client = yes
verify = 0
syslog = no
pid=$TMPFILE.pid
[${host}_${port}]
accept=${lport}
connect=${host}:${port}
" > $TMPFILE

$STUNNEL_EXEC $TMPFILE

Use check_dig via the stunnel

$CHECK_DIG_EXEC -H localhost -p ${lport} $ARGS
e_status=$?

cleanup

kill -9 cat $TMPFILE.pid
rm -f $TMPFILE $TMPFILE.pid
exit $e_status

Could the nameserver be missing within this file since check_dig itself is working properly as long as I pass it the necessary nameserver information which I pulled from resolv.conf:

./check_dig -H 10.5.126.21 -l unbound-telia01.fedoraproject.org
DNS OK - 0.011 seconds response time (unbound-telia01.fedoraproject.org. 300 IN A 80.239.156.220)|time=0.011390s;;;0.000000

Running check_dig without the nameserver appears to recreate the same error we are seeing within nagios:

./check_dig -l unbound-telia01.fedoraproject.org
DNS WARNING - 3.014 seconds response time (dig returned an error status)|time=3.013987s;;;0.000000

I used the check_dig manual article to formulate my command:
http://nagiosplugins.org/man/check_dig

Well, it's a bit more complicated. ;)

The reason for the stunnel is that we are querying port 443 on the unbound server.
Our unbound servers run a tls/ssl dns on tcp/443.

check_dig is unable to directly check this because it doesn't talk ssl, so we have to pass it through the stunnel, so you can see the script passes in localhost there (for the stunnel to port 443/tcp).

If we can come up with a way to check it that doesn't involve the stunnel that would be great too. ;)

I went back and stepped through the check_dig_ssl script again and confirmed that the config file for stunnel is being created and the tunnel appears up. I was also able to telnet to localhost port 8443 (the port being set in the script by lport) Telnet stated connected then subsequently disconnected. I believe this test is sufficient to confirm that the socket is being created successfully however I am not certain...

With the stunnel up, the next step is to run check_dig: ./check_dig -v -H localhost -p 8443 -l unbound-telia01.fedoraproject.org -A "+tcp"

But unfortunately this just returns the same warning that nagios is reporting: DNS WARNING - 0.383 seconds response time (dig returned an error status

STUNNEL
{{{
[neldogz@noc01 plugins]$ ps -ef | grep stunnel
neldogz 3513 1 0 03:05 pts/0 00:00:00 /usr/bin/stunnel /tmp/-bash_unbound-telia01.fedoraproject.org_443_0A0IS
neldogz 3514 1 0 03:05 pts/0 00:00:00 /usr/bin/stunnel /tmp/-bash_unbound-telia01.fedoraproject.org_443_0A0IS
neldogz 3515 1 0 03:05 pts/0 00:00:00 /usr/bin/stunnel /tmp/-bash_unbound-telia01.fedoraproject.org_443_0A0IS
neldogz 3516 1 0 03:05 pts/0 00:00:00 /usr/bin/stunnel /tmp/-bash_unbound-telia01.fedoraproject.org_443_0A0IS
neldogz 3517 1 0 03:05 pts/0 00:00:00 /usr/bin/stunnel /tmp/-bash_unbound-telia01.fedoraproject.org_443_0A0IS
neldogz 3518 1 0 03:05 ? 00:00:00 /usr/bin/stunnel /tmp/-bash_unbound-telia01.fedoraproject.org_443_0A0IS
neldogz 6370 554 0 03:12 pts/0 00:00:00 grep stunnel

STUNNEL Config File
[neldogz@noc01 plugins]$ cat /tmp/-bash_unbound-telia01.fedoraproject.org_443_0A0IS

client = yes
verify = 0
syslog = no
pid=/tmp/-bash_unbound-telia01.fedoraproject.org_443_0A0IS.pid
[unbound-telia01.fedoraproject.org_443]
accept=8443
connect=unbound-telia01.fedoraproject.org:443

}}}

At this point I am not sure why check_dig is not resolving a request over the tunnel. Some things that I can think of checking but I don't have access to are the following:

1) Is there anything within var/log/messages that could be useful?
2) Confirm if there is any kind of packet filtering being performed between Noc01 and unbound-telia01 which could be preventing this to succeed.
3) Confirm that the ssl-dns service is indeed servicing those types of dns requests on unbound-telia01.
4) Confirm if anyone knows of a way to test an end-to-end stunnel connectivity between these servers? Maybe somehow confirm the stunnel goes up on unbound-telia01 when nagios kicks off the scheduled check? I am not sure if this is how an stunnel is supposed to behave.

Additionally I thew a post on the nagios forums: http://support.nagios.com/forum/viewtopic.php?f=7&t=5593

Thats a lot for looking into this!

  1. I don't think so, but can look.

  2. There should be no packet filtering between them on port 443. Thats going over our vpn.

  3. I'll have to ask the unbound maintainer about this. I'll note we have 2 more unbound servers that don't seem to show this warning, they seem to show ok all the time. ;(

  4. It seems to work for the other 2 instances we are monitoring.

I can see if I can get the unbound maintainer to look at this some... adding them to the ticket...

Any ideas?

Hi,
sorry for the late response. As i'm the one who wrote this bad script (check_dig_ssl) i thought i ought to have a look on this.

First, check_dig_ssl is just doing a ssl tunnel and runs check_dig via this. And here we have 2 issues:
- The check_dig script reports warning instead of the real error
- unbound at telia01 seems to not work as it should.

Regarding the first issue, check_dig seems to hide that unbound at telia immediately closes the TCP connection after connection is established:

{{{
[ctria@noc01 ~]$ telnet unbound-telia01.fedoraproject.org 443
Trying 80.239.156.220...
Connected to unbound-telia01.fedoraproject.org.
Escape character is '^]'.
Connection closed by foreign host.
}}}

I'm not sure if i can fix this wrong error reporting on check_dig_ssl side, probably no.

Regarding the unbound issue at telia01, i guess something useful should be find at logs at that host (sorry no access there).

Regards,

Christos

Something was odd with the unbound process on unbound-telia01, so I restarted the service. It no longer hangs when connected to, but still returns the warning. ;(

The problem seems to have been resolved sometime after the restarting of the unbound process on unbound-telia01.

The problem is no longer being reported by Nagios: https://admin.fedoraproject.org/nagios/cgi-bin//extinfo.cgi?type=2&host=unbound-telia01&service=Unbound+443%2Ftcp

Confirmed this with nirik.

Login to comment on this ticket.

Metadata