#231 SSL proxies for taskotron-prod are broken due to incorrect DNS setting
Closed: Fixed 6 years ago Opened 6 years ago by kparal.

Yesterday taskotron-prod (and resultsdb-prod) became inaccessible due to SSL proxies not being handled correctly. Puiterwijk fixed it for us, and told us it's because our NetworkManager.conf is not excluding dns, and therefore messing up resolv.conf. It has been temporarily fixed by adjusting /etc/sysconfig/network-scripts/ifcfg-eth0, but it needs to be fixed properly.

Correct NetworkManager.conf should have been generated by the ansible base role, here:

roles/base/tasks/main.yml:

- name: disable resolv.conf control from NM
  ini_file: dest=/etc/NetworkManager/NetworkManager.conf section=main option=dns value=none
  notify:
  - restart NetworkManager
  when: ansible_distribution_major_version|int >=7 and nmclitest|success and ( not ansible_ifcfg_blacklist)
  tags:
   - config
   - resolvconf
   - base
   - ifcfg

However, that task is skipped because we have ansible_ifcfg_blacklist set:

$ git grep ansible_ifcfg_blacklist | grep -E '(taskotron|resultsdb)'
inventory/host_vars/resultsdb-stg01.qa.fedoraproject.org:ansible_ifcfg_blacklist: true
inventory/host_vars/resultsdb01.qa.fedoraproject.org:ansible_ifcfg_blacklist: true
inventory/host_vars/taskotron-stg01.qa.fedoraproject.org:ansible_ifcfg_blacklist: true
inventory/host_vars/taskotron01.qa.fedoraproject.org:ansible_ifcfg_blacklist: true

We need to figure out why we have it set, change it if appropriate, or make sure the conf gets updated in our own playbook.

Full conversation follows:

<puiterwijk> Okay, taskotron down it seems. Looking
<kparal> it's up, just ssl seems not working
<puiterwijk> kparal: that means it's not up.
<puiterwijk> Since http is also handled by proxies
<kparal> interesting, just a while ago http returned with 302 to https
<puiterwijk> yes
<puiterwijk> That's done by the proxies
<puiterwijk> There's no communication to taskotron01
<puiterwijk> Its VPN link is offline
<puiterwijk> Ah. The resolv.conf is broken
<puiterwijk> kparal: running base role
<kparal> is that something we caused somehow?
<puiterwijk> Well, somehow the NetworkManager config got lost
<puiterwijk> We configure NM to not mess up the resolv.conf
<kparal> I'm not aware of any action that would cause that
<kparal> there was a dnf upgrade yesterday, but NM was not involved
<puiterwijk> Okay, the network config file is incorrect
<puiterwijk> Aug 10 09:51:31 taskotron01.qa.fedoraproject.org openvpn[2115]: Initialization Sequence Completed
<puiterwijk> and it's working again
<puiterwijk> https://taskotron.fedoraproject.org/
<kparal> puiterwijk: so all we needed to do was to run the full playbook (which would include the base role) and it would get fixed, right?
<kparal> just making sure I know what to do next time
<puiterwijk> kparal: well, not really, since normally NM should not do dns in our instances, but for some reason taskotron is setup to do so
<puiterwijk> I can look at that in a few, first looking at something else
<kparal> jskladan: ^^ taskotron landing page and buildmaster is up now. resultsdb is not, though. is that something you're going to fix on your side?
<puiterwijk> kparal: likely resultsdb server has the same issue
<puiterwijk> yep
<puiterwijk> kparal: coming back
<kparal> thanks a lot puiterwijk 
<kparal> puiterwijk++
<puiterwijk> same story: please check your NetworkManager.conf why it's not excluding dns
<kparal> my cookies are depleted
<puiterwijk> :)
<kparal> puiterwijk: the reason seems because we have ansible_ifcfg_blacklist: true for those machines
<kparal> I can ask tflink about that, I have no idea why it is setup that way
<puiterwijk> ack
<kparal> puiterwijk: what exactly did you do to keep NM out of dns? I don't see the config file changed
<puiterwijk> kparal: for now I've set the /etc/sysconfig/network-scripts/ifcfg-eth0 to include the search domain and both DNS servers
<puiterwijk> So that NM generates a correct resolv.conf
<kparal> ok

@tflink, we need your input here, thanks.

Metadata Update from @kparal:
- Issue priority set to: High
- Issue tagged with: infra

6 years ago

I don't remember why that is set, honestly.

From the git logs, it looks like it was set to work around some network device naming issue about 1.5 years ago:

https://infrastructure.fedoraproject.org/cgit/ansible.git/commit/inventory/host_vars/taskotron01.qa.fedoraproject.org?id=b09e7b0fd3b6d86b4f8dc16e82fb9ae9134968a6

Metadata Update from @kparal:
- Issue assigned to kparal
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

6 years ago

Login to comment on this ticket.

Metadata