Here's a funny situation. The dc/greenwave-web containers are able to access the dc/greenwave-memcached containers in stg, but for some reason they are unable to do so in prod.
dc/greenwave-web
dc/greenwave-memcached
[PROD]# oc rsh dc/greenwave-web sh-4.4$ nc greenwave-memcached 11211 Ncat: Could not resolve hostname "greenwave-memcached": Name or service not known. QUITTING. sh-4.4$ nslookup greenwave-memcached Server: 10.5.126.248 Address: 10.5.126.248#53 Name: greenwave-memcached.greenwave.svc.cluster.local Address: 172.30.7.195
/cc @codeblock @mjia @dcallagh
Blocking #6363
One observation: the contents of /etc/resolv.conf are non-trivially different between the stage and prod containers. Here's prod:
os-master01 ~][PROD]# oc rsh dc/greenwave-web sh-4.4$ cat /etc/resolv.conf search greenwave.svc.cluster.local svc.cluster.local cluster.local phx2.fedoraproject.org vpn.fedoraproject.org fedoraproject.org nameserver 10.5.126.164 nameserver 10.5.126.21 nameserver 10.5.126.22 options rotate timeout:1 options ndots:5
and staging:
os-master01 ~][STG]# oc rsh dc/greenwave-web sh-4.4$ cat /etc/resolv.conf nameserver 10.5.128.104 search greenwave.svc.cluster.local svc.cluster.local cluster.local phx2.fedoraproject.org vpn.fedoraproject.org fedoraproject.org options ndots:5
Looks like our prod dns additions might be causing this?
if you remove:
nameserver 10.5.126.21 nameserver 10.5.126.22 options rotate timeout:1
Does it then work?
I'm not sure how to in a container! :bento:
Will try oc debug:
oc debug
os-master01 ~][PROD]# oc debug dc/greenwave-web Debugging with pod/greenwave-web-debug, original command: /bin/sh -c gunicorn --bind 0.0.0.0:8080 --access-logfile=- greenwave.wsgi:app Waiting for pod to start ... Pod IP: 10.128.2.92 If you don't see a command prompt, try pressing enter. sh-4.4# nc greenwave-memcached 11211 Ncat: Could not resolve hostname "greenwave-memcached": Name or service not known. QUITTING. sh-4.4# echo time to remove stuff time to remove stuff sh-4.4# vi /etc/resolv.conf sh-4.4# nc greenwave-memcached 11211 blah blah blah ERROR good. ERROR
Looks like that works. I wonder what writes out /etc/resolv.conf (how do we change this in ansible and/or openshift?)
/etc/resolv.conf
These are in the base role setup on the nodes... I think it copies that from the node.
That said, stg also has that in resolv.conf on the nodes... so I wonder if that wasn't changed in 3.6 (which we have in stg only).
Hm. Any plans to update prod to 3.6?
Or should we pursue modifying /etc/resolv.conf on the prod nodes directly (via the base role)?
Updating /etc/resolv.conf directly would break the system, and we should not do that. However, we can make a separate resolv.conf for just the pods: https://access.redhat.com/solutions/2215521
https://access.redhat.com/solutions/2215521
OK, I can't seem to access that. If you give me pointers or a transcription, I can take a stab at setting it up?
I will send a summary to you in a PM.
Thanks for the info @puiterwijk. The instructions there make sense - but, I haven't found my way around how our cluster is configured yet (to see where I might tweak /etc/origin/node/node-config.yaml). I'm probably in over my head, here.
/etc/origin/node/node-config.yaml
Okay, I'll take a look at this tonight then. Our cluster is indeed setup somewhat... interesgingly.
From looking at the upstream openshift/openshift-ansible repo, it looks like the dnsRecursiveResolvConf setting in node-config.yaml is only introduced in openshift 3.6 (which we have in stg but not in prod).
openshift/openshift-ansible
dnsRecursiveResolvConf
node-config.yaml
Is our best option to tear down and rebuild the whole prod cluster? Nothing will depend on greenwave or waiverdb until this unblocks #6363, so it would be OK to do with downtime.
Well, we upgraded our staging one and @codeblock said that went pretty easily, so we could just try and do that to prod next week?
@codeblock - is it a matter of destroying the prod nodes, adjust the version in the playbook, and just re-running it from scratch?
@ralph When I did the stg upgrade, I didn't need to destroy anything, I just ran the upgrade ansible-openshift playbooks from os-control and it basically did everything automatically. Then I rebooted everything and we were good. The whole process took maybe an hour.
:sparkles: awesome :sparkles:
Note a wrinkle here: We recently got some ssds to replace spinning drives in 2 machines.
One of the machines I'd like to do so in is virthost06, which has our production openshift in it.
So, we can upgrade as needed, but I'd like to take a bit of downtime after freeze and replace all the drives and re-install. Will that be ok? Or should we look at migrating the openshift nodes off it, and then back on?
Nope. Downtime still ok. We won't have anything depending on this until Bodhi is switched to use it (a config option). Also, @mohanboddu is working on a rawhide gating tool that will consult greenwave too. So long as neither of those are in place, no one will notice downtime.
I applied the fix from the access.redhat.com article, and the problem is fixed. Will move it to ansible now.
Metadata Update from @puiterwijk: - Issue close_status updated to: Fixed - Issue status updated to: Closed (was: Open)
Nice!
Metadata Update from @ralph: - Issue status updated to: Open (was: Closed)
Okay, so I've found why openshift 3.6 does not have this problem, and why the "fix" caused problems resolving anything outside of openshift. Basically, in OpenShift 3.6, the DNS servers now do recursive lookups, whereas with 3.5, the pods are expected to do external lookups, and the external DNS server is supposed to return NXDOMAIN for the cluster.local domains, which will make the container fall back to the in-cluster DNS server (currently, we return REFUSED, which does not make libresolv fall over to the internal DNS server). We could fix our nameservers to correctly return NXDOMAIN for the cluster.local lookups, but I think it's simpler to just upgrade to openshift 3.6.
(For the record: I fixed the fact that it could not look external by making the node's dnsmasq allow this. So, for the moment, our openshift prod instance works completely.)
Thanks!
I verified with nc in oc debug.
nc
And, I've applied this patch which tells greenwave to use memcached in prod again: https://infrastructure.fedoraproject.org/cgit/ansible.git/commit/?id=d5d94eb93e06d233d34c2f6dd4c852fafa8bb220
prod is 3.6 now too.
:open_hands:
Metadata Update from @kevin: - Issue close_status updated to: Fixed - Issue status updated to: Closed (was: Open)
Login to comment on this ticket.