Issue #123: Poor support for multiple VPNs (should we enable systemd-resolved?) - fedora-workstation

fedora-workstation

#123 Poor support for multiple VPNs (should we enable systemd-resolved?)

Closed: Fixed 3 years ago by catanzaro. Opened 4 years ago by catanzaro.

I had some major difficulty last week accessing some VPN-protected resources, even though my VPN was successfully connected. Eventually I tracked the problem down to NetworkManager not properly handling the DNS when two VPNs are connected at the same time. It's actually a documented behavior of NetworkManager. What happens is this: whichever VPN is connected first gets all the DNS queries, even if one is set to "Use this connection only for resources on its network." But what users expect to happen is this: DNS queries use the nameserver associated with the VPN only if the request is for the appropriate network. So, for example, I have VPN A, which autoconnects and has "Use this connection only for resources on its network" unchecked. Then I have VPN B, which I manually connect to second, with "Use this connection only for resources on its network" checked, connecting to example.com. If I try to resolve subdomain.example.com, that query needs to be sent to the DNS server specified by VPN B. But currently, it goes to the DNS server specified by VPN A, which will return a negative result for resources on B's internal network, and the connection fails.

Furthermore, a complicating factor is that users expect that if a VPN is configured without "Use this connection only for resources on its network," then DNS queries will never be sent to a fallback nameserver. That would be a "DNS leak," a serious privacy violation. For example, say I connect to VPN B first, then VPN A, in order to successfully access resources on the B network. Then all of my DNS requests go to VPN B's DNS server, which is a DNS leak because the user expects them to be going to VPN A instead (unless they are for example.com). That's not OK either.

According to the NetworkManager documentation, I think we can fix both problems by configuring NetworkManager to use dnsmasq or systemd-resolved for name resolution instead, of which the later seems a lot more plausible. And NetworkManager will automatically use systemd-resolved if the service is running. I've tested and confirmed systemd-resolved seems to resolve (heh) both of my problems. So: I wonder, why are we not using it? I assume there are good reasons we have this turned off currently. I'd like to know what they are. And unless they are very compelling reasons, it's probably time to turn it on.

I'm playing with systemd-resolved currently and I notice it has Cloudflare and Google DNS hardcoded as fallbacks. Inspecting resolvectl status, I'm fairly confident it handles these correctly, using them only for the non-tun interfaces. So we should be confident that DNS leaks won't happen.

To enable resolved, we would have to do more than change the preset. We'd also have to change /etc/nsswitch.conf and symlink /etc/resolv.conf to /run/systemd/resolve/stub-resolv.conf, documented here. So this would be a significant system-wide change.

Relevant blog post

CC @thaller for the NetworkManager side of this, and @zbyszek for systemd

ngompa commented 4 years ago

It would make sense for us to default to dnsmasq backend. Not only because I think it's a better local resolver (I've been using it on my workstation since Fedora 25), but because it also is a requirement for the CodeReady Containers thing for running OpenShift/OKD locally. Having it enabled by default in Fedora for NetworkManager would vastly simplify setup for things like crc and other similar things that require NetworkManager to work with dnsmasq.

thaller commented 4 years ago

According to the NetworkManager documentation, I think we can fix
both problems by configuring NetworkManager to use dnsmasq or
systemd-resolved for name resolution instead, of which the later
seems a lot more plausible.

I agree, systemd-resolved seems the much better choice. In principle, both dnsmasq and systemd-resolved act as a central service for managing name resolution. In practice, the (D-Bus) API of systemd-resolved is much more suited.

Most importantly, with dnsmasq DNS plugin, NetworkManager spawns a dnsmasq process but it still very much has the notion that the entire name resolution is fully controlled by NetworkManager. It's a simple mode to get a split DNS setup. But it is rather difficulty to inject your own DNS configuration into the dnsmasq service run by NetworkManager (without configuring it via connection profiles in NetworkManager). On the other hand, systemd-resolved runs as central service and can accept configurations from various independent services (e.g. NetworkManager + networkd + openvpn + your own script can all cooperate).

Also, with connectivity check enabled in NetworkManager, NetworkManager should resolve the hostname of the check URL per-device. Only systemd-resolved provides an API for doing so (ResolveHostname()). So, if you don't have systemd-resolved installed, there are quite some limitations. Note that in NetworkManager can also enable systemd-resolved without having it the "central" DNS plugin. See main.systemd-resolved in man NetworkManager.conf.

So: I wonder, why are we not using it? I assume
there are good reasons we have this turned off currently. I'd like to
know what they are.

I think only historic reasons. We never switched so far, and nobody did the work nor invested enough effort (like you do now). AFAIU, Ubuntu uses systemd-resolved for a long time now.

but because it also is a requirement for the CodeReady Containers thing for running OpenShift/OKD locally

But doesn't Openshift run dnsmasq also as a DHCP service? Also, isn't that an independent dnsmasq instance exclusively controlled by Openshift?

Note that with the dnsmasq plugin, it is NetworkManager who spawns the process. With systemd-resolved the service runs as an independent (D-Bus activated) service. That is much better because it decouples the lifetimes of the two services. Of course, in theory that would be fixable. In practice, Openshift cannot reuse the dnsmasq instance run by NetworkManager.

I think dnsmasq is a very nice project, but it does many other things (DHCP) and does not really have the same use-case as systemd-resolved. I think systemd-resolved is the way to go. It provides unique and compelling features that I'd like to see.

ngompa commented 4 years ago

But doesn't Openshift run dnsmasq also as a DHCP service? Also, isn't that an independent dnsmasq instance exclusively controlled by Openshift?

It does not. It uses the one configured with NetworkManager. It adds dnsmasq configuration snippet and then tells NM to use dnsmasq for DNS. It does not control it beyond that.

ngompa commented 4 years ago

To be clear, crc uses dnsmasq because it can configure libvirt to use dnsmasq to provide DHCP on a virtual network it creates, and that it can tell NetworkManager to use it for DNS. This lets it simulate the split-horizon setup that production OpenShift deployments require.

catanzaro commented 4 years ago

On the other hand, systemd-resolved runs as central service and can accept configurations from various independent services (e.g. NetworkManager + networkd + openvpn + your own script can all cooperate).

I thought NetworkManager and networkd were mutually-exclusive?

catanzaro commented 4 years ago

I think only historic reasons. We never switched so far, and nobody did the work nor invested enough effort (like you do now). AFAIU, Ubuntu uses systemd-resolved for a long time now.

Internet says Ubuntu has used it since 16.10. So we're over three years behind. Dropped the ball here, I guess....

chrismurphy commented 4 years ago

How about raising it on devel@ for wider discussion? Is this something we can just change in Fedora 32? Or would it be for Fedora 33?

catanzaro commented 4 years ago

This would be for F33.

I might just submit a change proposal (which will trigger discussion on devel@) if the WG has no objections.

Implementation would be:

Edit default /etc/nsswitch.conf provided by glibc
Edit nsswitch template provided by authselect
Symlink /etc/resolv.conf -> /run/systemd/resolve/stub-resolv.conf
Change preset to enable the service

Problem would be the preset change will affect upgrades, so systemd-resolved will be running on upgraded systems, but it won't work because all of the other changes are to /etc so we'll only wind up with .rpmnew files. i.e. the service will be running uselessly on upgraded systems. I guess devel@ might have suggestions for how to handle that.

chrismurphy commented 4 years ago

I'm strongly biased toward stomping on existing settings to compel the upgrade to the new way of things, as much as possible. The alternative is upgraded systems increasingly depart from clean installs, and become more non-deterministic in subsequent upgrades.

Two relevant items:
Fedora QA release criterion for upgrades: ...must be possible to successfully complete a direct upgrade from a fully updated, clean default installation of each of the last two stable Fedora releases... [1]

Emphasis on clean installation. If the bug only happens when upgrading a system that is itself an upgrade, rather than an up-to-date clean install, it's not a blocker bug.

Fedora Workstation PRD : ...upgrade process should give a result that is the same as an original install of Fedora Workstation... [2] Yes this is a legacy document, it doesn't bind us. But it's consistent with the upgrade release criterion. The more fragmentation happens, the more likely we'll see bugs that we don't block on.

[1]
https://fedoraproject.org/wiki/Fedora_32_Beta_Release_Criteria#Upgrade_requirements
[2]
https://fedoraproject.org/wiki/Workstation/Workstation_PRD

Metadata Update from @catanzaro:
- Issue tagged with: meeting

4 years ago

chrismurphy commented 4 years ago

Is there a summary and proposal for this issue? I'd like to put it on the agenda but I'm not certain it's ready for discussion or decision making. I think there's more to this than just enabling resolved by default?

Anecdote:
I haven't tried enabling systemd-resolved on Fedora; but it's the default on Arch. I just plugged in a Pi with Arch, that hasn't been updated since Oct, no VPN stuff configured at all. Trying to update it, all mirrors failed; and the journal has over 400 seconds of systemd-resolved complaints related to DNSSEC. It took a while to figure this out, but came down to systemctl stop systemd-resolved and then I was able to update. Following the reboot, systemd-resolved is not complaining, and arch repo mirrors are still accessible. So my concern is, something resolvd needs could become stale on Fedora Live media over its ~13 month life span, and because resolved is enabled by default, the user can't get updates. I don't know enough about it to say one way or another, but what I just experienced on Arch I would not want Fedora users to experience.

catanzaro commented 4 years ago

Is there a summary and proposal for this issue? I'd like to put it on the agenda but I'm not certain it's ready for discussion or decision making. I think there's more to this than just enabling resolved by default?

No, that's all there is to it. Well, we also need to discuss upgrade path. If the WG wants to enable systemd-resolved, I will handle preparing a change proposal and implementing the change.

I haven't tried enabling systemd-resolved on Fedora; but it's the default on Arch. I just plugged in a Pi with Arch, that hasn't been updated since Oct, no VPN stuff configured at all. Trying to update it, all mirrors failed; and the journal has over 400 seconds of systemd-resolved complaints related to DNSSEC. It took a while to figure this out, but came down to systemctl stop systemd-resolved and then I was able to update. Following the reboot, systemd-resolved is not complaining, and arch repo mirrors are still accessible. So my concern is, something resolvd needs could become stale on Fedora Live media over its ~13 month life span, and because resolved is enabled by default, the user can't get updates. I don't know enough about it to say one way or another, but what I just experienced on Arch I would not want Fedora users to experience.

This, though, is beyond me to help with. I'm certainly not an expert in systemd-resolved, nor do I know what might have happened here.

FWIW I'm currently using systemd-resolved on Fedora and it's working very nicely, unlike our broken default configuration.

thaller commented 4 years ago

for the record:

From NetworkManager side, we are very much in favour of such a change (i.e. having systemd-resolved as the default solution in future Fedora).

catanzaro commented 4 years ago

I haven't tried enabling systemd-resolved on Fedora; but it's the default on Arch. I just plugged in a Pi with Arch, that hasn't been updated since Oct, no VPN stuff configured at all. Trying to update it, all mirrors failed; and the journal has over 400 seconds of systemd-resolved complaints related to DNSSEC. It took a while to figure this out, but came down to systemctl stop systemd-resolved and then I was able to update. Following the reboot, systemd-resolved is not complaining, and arch repo mirrors are still accessible. So my concern is, something resolvd needs could become stale on Fedora Live media over its ~13 month life span, and because resolved is enabled by default, the user can't get updates. I don't know enough about it to say one way or another, but what I just experienced on Arch I would not want Fedora users to experience.

I checked for possibly-related bugs and found a messy bugtracker. Can you report a new bug please, or select an existing one to piggyback onto? If systemd-resolved is not working properly for a WG member, that probably merits blocking on the problem, but there needs to be a bug report. :)

Edited 4 years ago by catanzaro

chrismurphy commented 4 years ago

I don't think this is a blocking problem per se. It is working now (since updating). Since I don't understand how resolved works, I'm not sure if the update fixed a bug, or if some peripheral resource (like a database of domains) was updated, and that's what fixed it. But yes, the referenced issue does describe what I was experiencing. Seems likely updates weren't working because any .org was affected, and archlinux repo metadata was looking for a .org domain.

catanzaro commented 4 years ago

Agreed: I will submit an F33 change proposal for this.

(We'll have to address the question of whether to use dnsmasq instead.)

Metadata Update from @catanzaro:
- Issue untagged with: meeting

4 years ago

Metadata Update from @catanzaro:
- Issue assigned to catanzaro

4 years ago

Metadata Update from @chrismurphy:
- Issue tagged with: meeting

4 years ago

Metadata Update from @chrismurphy:
- Issue set to the milestone: Fedora 33

4 years ago