Issue #6465: Odd openshift networking issue (prod) - fedora-infrastructure

fedora-infrastructure

#6465 Odd openshift networking issue (prod)

Closed: Fixed 6 years ago Opened 6 years ago by ralph.

Here's a funny situation. The dc/greenwave-web containers are able to access the dc/greenwave-memcached containers in stg, but for some reason they are unable to do so in prod.

[PROD]# oc rsh dc/greenwave-web

sh-4.4$ nc greenwave-memcached 11211
Ncat: Could not resolve hostname "greenwave-memcached": Name or service not known. QUITTING.

sh-4.4$ nslookup greenwave-memcached
Server:         10.5.126.248
Address:        10.5.126.248#53
Name:   greenwave-memcached.greenwave.svc.cluster.local
Address: 172.30.7.195

ralph commented 6 years ago

/cc @codeblock @mjia @dcallagh

Edited 6 years ago by ralph

ralph commented 6 years ago

Blocking #6363

Edited 6 years ago by ralph

ralph commented 6 years ago

One observation: the contents of /etc/resolv.conf are non-trivially different between the stage and prod containers. Here's prod:

os-master01 ~][PROD]# oc rsh dc/greenwave-web
sh-4.4$ cat /etc/resolv.conf
search greenwave.svc.cluster.local svc.cluster.local cluster.local phx2.fedoraproject.org vpn.fedoraproject.org fedoraproject.org
nameserver 10.5.126.164
nameserver 10.5.126.21
nameserver 10.5.126.22
options rotate timeout:1
options ndots:5

and staging:

os-master01 ~][STG]# oc rsh dc/greenwave-web
sh-4.4$ cat /etc/resolv.conf 
nameserver 10.5.128.104
search greenwave.svc.cluster.local svc.cluster.local cluster.local phx2.fedoraproject.org vpn.fedoraproject.org fedoraproject.org
options ndots:5

kevin commented 6 years ago

Looks like our prod dns additions might be causing this?

if you remove:

nameserver 10.5.126.21
nameserver 10.5.126.22
options rotate timeout:1

Does it then work?

ralph commented 6 years ago

if you remove:

I'm not sure how to in a container! :bento:

Will try oc debug:

os-master01 ~][PROD]# oc debug dc/greenwave-web
Debugging with pod/greenwave-web-debug, original command: /bin/sh -c gunicorn --bind 0.0.0.0:8080 --access-logfile=- greenwave.wsgi:app
Waiting for pod to start ...
Pod IP: 10.128.2.92
If you don't see a command prompt, try pressing enter.

sh-4.4# nc greenwave-memcached 11211
Ncat: Could not resolve hostname "greenwave-memcached": Name or service not known. QUITTING.

sh-4.4# echo time to remove stuff
time to remove stuff

sh-4.4# vi /etc/resolv.conf

sh-4.4# nc greenwave-memcached 11211
blah blah blah
ERROR
good.
ERROR

Looks like that works. I wonder what writes out /etc/resolv.conf (how do we change this in ansible and/or openshift?)

kevin commented 6 years ago

These are in the base role setup on the nodes... I think it copies that from the node.

That said, stg also has that in resolv.conf on the nodes... so I wonder if that wasn't changed in 3.6 (which we have in stg only).

ralph commented 6 years ago

Hm. Any plans to update prod to 3.6?

Or should we pursue modifying /etc/resolv.conf on the prod nodes directly (via the base role)?

puiterwijk commented 6 years ago

Updating /etc/resolv.conf directly would break the system, and we should not do that.
However, we can make a separate resolv.conf for just the pods: https://access.redhat.com/solutions/2215521

ralph commented 6 years ago

https://access.redhat.com/solutions/2215521

OK, I can't seem to access that. If you give me pointers or a transcription, I can take a stab at setting it up?

Edited 6 years ago by ralph

puiterwijk commented 6 years ago

I will send a summary to you in a PM.

ralph commented 6 years ago

Thanks for the info @puiterwijk. The instructions there make sense - but, I haven't found my way around how our cluster is configured yet (to see where I might tweak /etc/origin/node/node-config.yaml). I'm probably in over my head, here.

puiterwijk commented 6 years ago

Okay, I'll take a look at this tonight then.
Our cluster is indeed setup somewhat... interesgingly.

ralph commented 6 years ago

From looking at the upstream openshift/openshift-ansible repo, it looks like the dnsRecursiveResolvConf setting in node-config.yaml is only introduced in openshift 3.6 (which we have in stg but not in prod).

Is our best option to tear down and rebuild the whole prod cluster? Nothing will depend on greenwave or waiverdb until this unblocks #6363, so it would be OK to do with downtime.

kevin commented 6 years ago

Well, we upgraded our staging one and @codeblock said that went pretty easily, so we could just try and do that to prod next week?

ralph commented 6 years ago

@codeblock - is it a matter of destroying the prod nodes, adjust the version in the playbook, and just re-running it from scratch?

codeblock commented 6 years ago

@ralph When I did the stg upgrade, I didn't need to destroy anything, I just ran the upgrade ansible-openshift playbooks from os-control and it basically did everything automatically. Then I rebooted everything and we were good. The whole process took maybe an hour.

ralph commented 6 years ago

:sparkles: awesome :sparkles:

kevin commented 6 years ago

Note a wrinkle here: We recently got some ssds to replace spinning drives in 2 machines.

One of the machines I'd like to do so in is virthost06, which has our production openshift in it.

So, we can upgrade as needed, but I'd like to take a bit of downtime after freeze and replace all the drives and re-install. Will that be ok? Or should we look at migrating the openshift nodes off it, and then back on?

ralph commented 6 years ago

Nope. Downtime still ok. We won't have anything depending on this until Bodhi is switched to use it (a config option). Also, @mohanboddu is working on a rawhide gating tool that will consult greenwave too. So long as neither of those are in place, no one will notice downtime.

puiterwijk commented 6 years ago

I applied the fix from the access.redhat.com article, and the problem is fixed.
Will move it to ansible now.

Metadata Update from @puiterwijk:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

6 years ago

ralph commented 6 years ago

Nice!

Metadata Update from @ralph:
- Issue status updated to: Open (was: Closed)

6 years ago

puiterwijk commented 6 years ago

Okay, so I've found why openshift 3.6 does not have this problem, and why the "fix" caused problems resolving anything outside of openshift.
Basically, in OpenShift 3.6, the DNS servers now do recursive lookups, whereas with 3.5, the pods are expected to do external lookups, and the external DNS server is supposed to return NXDOMAIN for the cluster.local domains, which will make the container fall back to the in-cluster DNS server (currently, we return REFUSED, which does not make libresolv fall over to the internal DNS server).
We could fix our nameservers to correctly return NXDOMAIN for the cluster.local lookups, but I think it's simpler to just upgrade to openshift 3.6.

puiterwijk commented 6 years ago

(For the record: I fixed the fact that it could not look external by making the node's dnsmasq allow this. So, for the moment, our openshift prod instance works completely.)

ralph commented 6 years ago

Thanks!

I verified with nc in oc debug.

And, I've applied this patch which tells greenwave to use memcached in prod again: https://infrastructure.fedoraproject.org/cgit/ansible.git/commit/?id=d5d94eb93e06d233d34c2f6dd4c852fafa8bb220

kevin commented 6 years ago

prod is 3.6 now too.

:open_hands:

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

6 years ago

Metadata

Assignee

None

Tags

None

Blocking

None

Depending on

None

Priority

None

fedora-infrastructure

Source Code

#6465 Odd openshift networking issue (prod) Closed: Fixed 6 years ago Opened 6 years ago by ralph.

Metadata

#6465 Odd openshift networking issue (prod)

Closed: Fixed 6 years ago Opened 6 years ago by ralph.