From 2ae31748095842d8e85abae2a383f57a402bc882 Mon Sep 17 00:00:00 2001 From: Jakub Hrozek Date: Feb 28 2019 09:06:13 +0000 Subject: Design document: Kerberos locator fail over --- diff --git a/design_pages/index.rst b/design_pages/index.rst index cf457d7..9b74c75 100644 --- a/design_pages/index.rst +++ b/design_pages/index.rst @@ -29,6 +29,7 @@ Implemented in 1.16.x .. toctree:: :maxdepth: 1 + kdcinfo_multiple_servers auto_private_groups hybrid_private_groups uid_negative_global_catalog diff --git a/design_pages/kdcinfo_multiple_servers.rst b/design_pages/kdcinfo_multiple_servers.rst new file mode 100644 index 0000000..5a90a26 --- /dev/null +++ b/design_pages/kdcinfo_multiple_servers.rst @@ -0,0 +1,272 @@ +.. highlight:: none + +Multiple server addresses or names in kdcinfo files +=================================================== + +Related ticket(s): +------------------ + * TBD + + +Problem statement +----------------- +When a user authenticates using Kerberos, the KDCs that will actually be +used are either discovered by libkrb5 with the help of DNS SRV records, +or the KDCs are configured explicitly in ``/etc/krb5.conf.`` or provided +by a special `locator plugin`. + +Because the administrator expects that the servers they defined in +``sssd.conf`` would be used for both authentication through SSSD and by +applications that use libkrb5, such the Kerberos command line tools like +``kinit``, SSSD provides a locator plugin for libkrb5 that allows SSSD to +inform libkrb5 about the servers SSSD had configured. + +However, SSSD, at least in the typical use case, only writes the information +about the single server it connects to and changes the address only when +the daemon reconnects to a different server. This creates a problem in case +the server whose address is written in the kdcinfo file is unreachable +but no action towards sssd that would provoke a fail over (such as a +user login over PAM) is executed. In that case, the kdcinfo file contains +stale entries and because from libkrb5 point of view, the kdcinfo files +are authoritative and if the information present there is not useful, +libkrb5 cannot reach any KDCs from that domain. + +To improve the situation, this design page proposes adding a new sssd option +that, if set, would enable sssd to write additional host names into the +kdcinfo files which would then allow the plugin to iterate over these +items and in turn allow libkrb5 to have sort of a failover for entries +configured in sssd.conf or autodiscovered by SSSD. + +Use cases +--------- +A typical sequence that triggers this problem is this: + * log in with a PAM service to a machine. This causes a KDC address to + be written to the kdcinfo file + * disable the KDC server, e.g. by enabling a restrictive firewall rule + * call kinit on the client where the kdcinfo file was written + +Overview of the solution +------------------------ +The Kerberos locator plugin reads the address(es) from per-realm text files +written by SSSD located in the ``/var/lib/sss/pubconf`` directory. At the +moment, the plugin can already read multiple entries, but currently only +numerical addresses are supported. + +On a high level, implementing this RFE requires several changes: + * change the Kerberos locator plugin so that it can also consume + host names in addition to numerical addresses. These host names + would be resolved in the plugin itself and passed to libkrb5 with + the help of a callback function libkrb5 provides to the plugin + * add a new SSSD option that would limit the number of entries that + SSSD writes to the kdcinfo plugin. This is needed to avoid time + outs in case the network was truly unreachable. The default value + of the option could perhaps be different in master and sssd-1-16 + where master could default to writing multiple entries, but + sssd-1-16 would default the option to 0 in order to not change + behaviour of a stable branch. + * extend the online callback which the SSSD fail over component uses + to write the current server to the kdcinfo files to also write + additional server host names in addition to the current server address + * to enable writing multiple server addresses, the request to resolve + a server for a service should be extended to resolve host names + up to the specified limit + +When it comes to resolving the servers, there are several scenarios to +consider: + + * The servers can be enumerated using an option. This includes + ``krb5_server/krb5_backup_server`` for the krb5 provider and + ``ipa_server/ipa_backup_server`` and ``ad_server/ad_backup_server`` + for the IPA and AD providers. + * The servers can be completely autodiscovered. Typically this is + done by either omitting the ``*_server`` options completely or + using the ``_srv_`` identifier. As long as the list is omitted + or the ``_srv_`` record is the first one in the list, any fail + over service resolution would trigger the DNS SRV lookups and + resolve the whole list. It is useful to note that the ``_srv_`` + identifier is not permitted in the backup server list explicitly, + but the AD provider does resolve a SRV query into the backup + server list. That is done in case an AD site is used, then the servers + from the AD site are added as 'primary' and the global servers + form the 'backup' list. + * A mix of the above. The most complex case from the point of + this RFE is a list that starts with a host name, but includes + the ``_srv_`` identifier later on, e.g. ``krb5_server = kdc.example.com, + _srv_``. In this case, currently calling the fail over resolution + would only resolve the host name of ``kdc.example.com``, but not + the SRV query, so unless the fail over code is extended, the + host names originating from the SRV query would not be known + after the service resolution finishes. + +Implementation details +---------------------- +The interface the locator plugin uses to communicate with libkrb5 is a +callback function provided by the caller (libkrb5), SSSD is supposed +to pass a struct sockaddr to the caller. The Kerberos locator plugin +is already capable of iterating over multiple addresses, but currently +really only numerical addresses are supported and the plugin converts +the string representation of the address into struct sockaddr by calling +``getaddrinfo(3)`` with the ``AI_NUMERICHOST`` parameter. We should extend +the locator plugin code by calling getaddrinfo for entries that do not +represent an address to resolve a host name and pass its address. This +can be a first self-contained step in the implementation. + +The kdcinfo files are written (using ``write_krb5info_file``) either +during an online callback or in a special-case for IPA trust clients. The +special case is already doing something similar to what this page +is about by looking into a subsection representing a trusted domain +(e.g. ``[domain/ipa.test/win.trust.test]``) and resolving all the servers +in that list either by name or based on a site selection. However, this +is done during the subdomain provider operation, not during a resolver +callback and all the addresses configured in the ``sssd.conf`` file are +always resolved and written to the config file. + +The ``write_krb5info_file`` receives a linked list of ``struct fo_server`` +structures which contains the address, if already resolved, or at least +a host name in the ``struct server_common`` member structure. Since the +callback should already be synchronous and not do much work on its own, it +would be best if the callback was already invoked with the data provided, + +There are two kinds of servers in the fail over module - primary and +backup. The backup servers are supposed to only be used temporarily +and sssd periodically tries to connect to one of the primary servers. +However, from the fail over code point of view, even adding a "backup" +server still means the server is added to the same linked list, just with +a flag denoting that the server is not primary, therfore iterating over +a single list would iterate over both the primary and backup servers. + +Before changing the online callbacks, it would be useful to implement and +read the ``krb5_kdcinfo_lookahead`` option so that there is already an +upper limit when the callbacks write the extra host names. + +The next step of implementation could be extending the online +callbacks that call the ``write_krb5info_file`` functions. There are +several of them, ``ad_resolve_callback``, ``ipa_resolve_callback`` +and ``krb5_resolve_callback``. The callbacks receive the current +``struct fo_server`` instance. The callbacks would then keep iterating +over the linked list until either the list is exhausted or as many as +``krb5_kdcinfo_lookahead`` items are processed. The host name from the +``struct server_common`` structure would be read using ``fo_get_server_name`` +and written to the array passed to ``write_krb5info_file``. + +One question to consider is whether to use the ``fo_server`` instances before +the current one, i.e. those that SSSD tried before and couldn't connect to. +I think it would make sense to add them to the end of the list, at least +for the primary servers not from a SRV query, because sssd never reconnects +to a server earlier in the list as long as later server works. The SRV queries +are different in this respect in the sense that they time out and force +SSSD to resolve the whole list once a server is requested again (typically +either during authentication or once the LDAP connection expires). + +Finally, the case where the fail over code needs to do additional lookups +in order to resolve at least the amount of host names requested by the +``krb5_kdcinfo_lookahead`` should be addressed. The caller that initializes +the fail over service (maybe with ``be_fo_add_service``) should provide +a hint with the value of the lookahead option. Then, if a request for +server resolution is triggered, the fail over code would resolve a server +and afterwards check if enough ``fo_server`` entries with a valid hostname +in the ``struct server_common`` structure. If not, the request would +check if any of the ``fo_server`` structures represents a SRV query and +try to resolve the query to receive more host names. + +Configuration changes +--------------------- +A new configuration option called ``krb5_kdcinfo_lookahead`` would be added. +This option would default to a sensible non-zero value in the master +branch, perhaps 3 so that attempting to resolve the extra host names does +not cause the libkrb5 operation to time out. If the patches are backported +to any stable branch, the option must default to 0 (disabled). + +In the first iteration, we might want to just read a single number, but +in the future, the option should be extended to accept two numbers in the +``total:backup`` notation. This would mean write up to ``total`` servers, +but include up to ``backup`` servers from the backup list. This would be +useful in case none of the servers from the primary list are reachable, +because e.g. they all come from the same AD site, but servers outside the +site are reachable. This extension would only make sense if SSSD does not +resolve the host names on its own, which might be another future extension. + +It might be a good idea to add a note to the ``sssd-ad`` and ``sssd-ipa`` +man pages or even the shared fail over man page include file with a pointer +to how the kdcinfo files work so that the information is easy to discover +for administrators. + +How To Test +----------- +Plugin test + With any of the below tests or even after writing the host names to + the kdcinfo files directly, make sure the first entry in the list is + unreachable. Then call e.g. `kinit` and check that the operation succeeds. + +Backwards compatibility test + Set the ``krb5_kdcinfo_lookahead`` option to 0. Define multiple servers + and perform Kerberos authentication. Make sure that only the current server + is written to the kdcinfo files. + +Write a list of servers + Set the ``krb5_resolve_callback`` to a positive value. Make sure that the + first entry in the kdcinfo files is an address and the other entries are + host names from the configuration. This test case should be extended to + make sure only so many entries as the value of the option are written, + or if there are fewer entries in the config file, all are writen. + +Fail over test + Similar to the above, except make sure the first entry in the list cannot + be contacted. Then, SSSD should resolve the next entry to the address + and if applicable write the rest of the list. + +Backup server test + At the minimum, we should make sure that servers from the backup list + are written to the kdcinfo files. If the option would implement the split + ``total:backup`` value, then those should be tested as well. + +(Optional) writing a previously tried, not working server + If it is agreed during design review that also not working servers are to + be written to the kdcinfo files (see the section about not working + servers), then a test case should make sure those + are written to the end of the list. + +SRV resolution test + Leave the server list (e.g. ``krb5_server``) option empty. Make sure + a DNS SRV query for the configured realm returns valid servers and + they are written to the config file. + +Combined SRV and server list + Set the ``krb5_server`` option to ``hostname, _srv_``. Set the + ``krb5_kdcinfo_lookahead`` option to a value greater than 1. Make + sure that the host names from the DNS SRV query are also present + in the kdcinfo files. + +IPA client test + The test cases above should be repeated for an IPA client as well in + case the IPA online callbacks are modified. + +AD site test + Add an AD client to a site or set the site in the config file. Make + sure that the servers from the site are written first, followed + by the global servers up to the ``krb5_kdcinfo_lookahead`` value. + +How To Debug +------------ +Any new code must be decorated with DEBUG messages. To debug the locator +plugin changes, using ``KRB5_TRACE`` or even calling ``strace`` might be +useful. + +Future development +------------------ +First, it might be useful to extend the resolver or fail over code to resolve +the names on its own to save some potentially blocking calls in the plugin. +There is already an example of ``resolv_hostport_list_send`` that can perhaps +be reused. + +Additionally, we already plan for some time to include connectivity checks +with cLDAP ping or just plain ``connect()`` to make sure that servers that +cannot be contacted at all are not tried. This is of course outside of the +scope of this work, but should be kept in mind to not implement something +incompatible. + +Authors +------- + * Sumit Bose + * Tomas Halman + * Jakub Hrozek