#78 Design document: Kerberos locator fail over
Closed 5 years ago by jhrozek. Opened 5 years ago by jhrozek.
SSSD/ jhrozek/docs libkrb5_failover  into  master

file modified
+1
@@ -29,6 +29,7 @@ 

  .. toctree::

     :maxdepth: 1

  

+    kdcinfo_multiple_servers

     auto_private_groups

     hybrid_private_groups

     uid_negative_global_catalog

@@ -0,0 +1,272 @@ 

+ .. highlight:: none

+ 

+ Multiple server addresses or names in kdcinfo files

+ ===================================================

+ 

+ Related ticket(s):

+ ------------------

+    * TBD

+ 

+ 

+ Problem statement

+ -----------------

+ When a user authenticates using Kerberos, the KDCs that will actually be

+ used are either discovered by libkrb5 with the help of DNS SRV records,

+ or the KDCs are configured explicitly in ``/etc/krb5.conf.`` or provided

+ by a special `locator plugin`.

+ 

+ Because the administrator expects that the servers they defined in

+ ``sssd.conf`` would be used for both authentication through SSSD and by

+ applications that use libkrb5, such the Kerberos command line tools like

+ ``kinit``, SSSD provides a locator plugin for libkrb5 that allows SSSD to

+ inform libkrb5 about the servers SSSD had configured.

+ 

+ However, SSSD, at least in the typical use case, only writes the information

+ about the single server it connects to and changes the address only when

+ the daemon reconnects to a different server. This creates a problem in case

+ the server whose address is written in the kdcinfo file is unreachable

+ but no action towards sssd that would provoke a fail over (such as a

+ user login over PAM) is executed. In that case, the kdcinfo file contains

+ stale entries and because from libkrb5 point of view, the kdcinfo files

+ are authoritative and if the information present there is not useful,

+ libkrb5 cannot reach any KDCs from that domain.

+ 

+ To improve the situation, this design page proposes adding a new sssd option

+ that, if set, would enable sssd to write additional host names into the

+ kdcinfo files which would then allow the plugin to iterate over these

+ items and in turn allow libkrb5 to have sort of a failover for entries

+ configured in sssd.conf or autodiscovered by SSSD.

+ 

+ Use cases

+ ---------

+ A typical sequence that triggers this problem is this:

+    * log in with a PAM service to a machine. This causes a KDC address to

+      be written to the kdcinfo file

+    * disable the KDC server, e.g. by enabling a restrictive firewall rule

+    * call kinit on the client where the kdcinfo file was written

+ 

+ Overview of the solution

+ ------------------------

+ The Kerberos locator plugin reads the address(es) from per-realm text files

+ written by SSSD located in the ``/var/lib/sss/pubconf`` directory. At the

+ moment, the plugin can already read multiple entries, but currently only

+ numerical addresses are supported.

+ 

+ On a high level, implementing this RFE requires several changes:

+    * change the Kerberos locator plugin so that it can also consume

+      host names in addition to numerical addresses. These host names

+      would be resolved in the plugin itself and passed to libkrb5 with

+      the help of a callback function libkrb5 provides to the plugin

+    * add a new SSSD option that would limit the number of entries that

+      SSSD writes to the kdcinfo plugin. This is needed to avoid time

+      outs in case the network was truly unreachable. The default value

+      of the option could perhaps be different in master and sssd-1-16

+      where master could default to writing multiple entries, but

+      sssd-1-16 would default the option to 0 in order to not change

+      behaviour of a stable branch.

+    * extend the online callback which the SSSD fail over component uses

+      to write the current server to the kdcinfo files to also write

+      additional server host names in addition to the current server address

+    * to enable writing multiple server addresses, the request to resolve

+      a server for a service should be extended to resolve host names

+      up to the specified limit

+ 

+ When it comes to resolving the servers, there are several scenarios to

+ consider:

+ 

+    * The servers can be enumerated using an option. This includes

+      ``krb5_server/krb5_backup_server`` for the krb5 provider and

+      ``ipa_server/ipa_backup_server`` and ``ad_server/ad_backup_server``

+      for the IPA and AD providers.

+    * The servers can be completely autodiscovered. Typically this is

+      done by either omitting the ``*_server`` options completely or

+      using the ``_srv_`` identifier. As long as the list is omitted

+      or the ``_srv_`` record is the first one in the list, any fail

+      over service resolution would trigger the DNS SRV lookups and

+      resolve the whole list. It is useful to note that the ``_srv_``

+      identifier is not permitted in the backup server list explicitly,

+      but the AD provider does resolve a SRV query into the backup

+      server list. That is done in case an AD site is used, then the servers

+      from the AD site are added as 'primary' and the global servers

+      form the 'backup' list.

+    * A mix of the above. The most complex case from the point of

+      this RFE is a list that starts with a host name, but includes

+      the ``_srv_`` identifier later on, e.g. ``krb5_server = kdc.example.com,

+      _srv_``. In this case, currently calling the fail over resolution

+      would only resolve the host name of ``kdc.example.com``, but not

+      the SRV query, so unless the fail over code is extended, the

+      host names originating from the SRV query would not be known

+      after the service resolution finishes.

+ 

+ Implementation details

+ ----------------------

+ The interface the locator plugin uses to communicate with libkrb5 is a

+ callback function provided by the caller (libkrb5), SSSD is supposed

+ to pass a struct sockaddr to the caller. The Kerberos locator plugin

+ is already capable of iterating over multiple addresses, but currently

+ really only numerical addresses are supported and the plugin converts

+ the string representation of the address into struct sockaddr by calling

+ ``getaddrinfo(3)`` with the ``AI_NUMERICHOST`` parameter. We should extend

+ the locator plugin code by calling getaddrinfo for entries that do not

+ represent an address to resolve a host name and pass its address. This

+ can be a first self-contained step in the implementation.

+ 

+ The kdcinfo files are written (using ``write_krb5info_file``) either

+ during an online callback or in a special-case for IPA trust clients. The

+ special case is already doing something similar to what this page

+ is about by looking into a subsection representing a trusted domain

+ (e.g. ``[domain/ipa.test/win.trust.test]``) and resolving all the servers

+ in that list either by name or based on a site selection. However, this

+ is done during the subdomain provider operation, not during a resolver

+ callback and all the addresses configured in the ``sssd.conf`` file are

+ always resolved and written to the config file.

+ 

+ The ``write_krb5info_file`` receives a linked list of ``struct fo_server``

+ structures which contains the address, if already resolved, or at least

+ a host name in the ``struct server_common`` member structure. Since the

+ callback should already be synchronous and not do much work on its own, it

+ would be best if the callback was already invoked with the data provided,

+ 

+ There are two kinds of servers in the fail over module - primary and

+ backup.  The backup servers are supposed to only be used temporarily

+ and sssd periodically tries to connect to one of the primary servers.

+ However, from the fail over code point of view, even adding a "backup"

+ server still means the server is added to the same linked list, just with

+ a flag denoting that the server is not primary, therfore iterating over

+ a single list would iterate over both the primary and backup servers.

+ 

+ Before changing the online callbacks, it would be useful to implement and

+ read the ``krb5_kdcinfo_lookahead`` option so that there is already an

+ upper limit when the callbacks write the extra host names.

+ 

+ The next step of implementation could be extending the online

+ callbacks that call the ``write_krb5info_file`` functions. There are

+ several of them, ``ad_resolve_callback``, ``ipa_resolve_callback``

+ and ``krb5_resolve_callback``. The callbacks receive the current

+ ``struct fo_server`` instance. The callbacks would then keep iterating

+ over the linked list until either the list is exhausted or as many as

+ ``krb5_kdcinfo_lookahead`` items are processed. The host name from the

+ ``struct server_common`` structure would be read using ``fo_get_server_name``

+ and written to the array passed to ``write_krb5info_file``.

+ 

+ One question to consider is whether to use the ``fo_server`` instances before

+ the current one, i.e. those that SSSD tried before and couldn't connect to.

+ I think it would make sense to add them to the end of the list, at least

+ for the primary servers not from a SRV query, because sssd never reconnects

+ to a server earlier in the list as long as later server works. The SRV queries

+ are different in this respect in the sense that they time out and force

+ SSSD to resolve the whole list once a server is requested again (typically

+ either during authentication or once the LDAP connection expires).

+ 

+ Finally, the case where the fail over code needs to do additional lookups

+ in order to resolve at least the amount of host names requested by the

+ ``krb5_kdcinfo_lookahead`` should be addressed. The caller that initializes

+ the fail over service (maybe with ``be_fo_add_service``) should provide

+ a hint with the value of the lookahead option. Then, if a request for

+ server resolution is triggered, the fail over code would resolve a server

+ and afterwards check if enough ``fo_server`` entries with a valid hostname

+ in the ``struct server_common`` structure. If not, the request would

+ check if any of the ``fo_server`` structures represents a SRV query and

+ try to resolve the query to receive more host names.

+ 

+ Configuration changes

+ ---------------------

+ A new configuration option called ``krb5_kdcinfo_lookahead`` would be added.

+ This option would default to a sensible non-zero value in the master

+ branch, perhaps 3 so that attempting to resolve the extra host names does

+ not cause the libkrb5 operation to time out. If the patches are backported

+ to any stable branch, the option must default to 0 (disabled).

+ 

+ In the first iteration, we might want to just read a single number, but

+ in the future, the option should be extended to accept two numbers in the

+ ``total:backup`` notation. This would mean write up to ``total`` servers,

+ but include up to ``backup`` servers from the backup list. This would be

+ useful in case none of the servers from the primary list are reachable,

+ because e.g. they all come from the same AD site, but servers outside the

+ site are reachable. This extension would only make sense if SSSD does not

+ resolve the host names on its own, which might be another future extension.

+ 

+ It might be a good idea to add a note to the ``sssd-ad`` and ``sssd-ipa``

+ man pages or even the shared fail over man page include file with a pointer

+ to how the kdcinfo files work so that the information is easy to discover

+ for administrators.

+ 

+ How To Test

+ -----------

+ Plugin test

+    With any of the below tests or even after writing the host names to

+    the kdcinfo files directly, make sure the first entry in the list is

+    unreachable. Then call e.g. `kinit` and check that the operation succeeds.

+ 

+ Backwards compatibility test

+    Set the ``krb5_kdcinfo_lookahead`` option to 0. Define multiple servers

+    and perform Kerberos authentication. Make sure that only the current server

+    is written to the kdcinfo files.

+ 

+ Write a list of servers

+    Set the ``krb5_resolve_callback`` to a positive value. Make sure that the

+    first entry in the kdcinfo files is an address and the other entries are

+    host names from the configuration. This test case should be extended to

+    make sure only so many entries as the value of the option are written,

+    or if there are fewer entries in the config file, all are writen.

+ 

+ Fail over test

+    Similar to the above, except make sure the first entry in the list cannot

+    be contacted. Then, SSSD should resolve the next entry to the address

+    and if applicable write the rest of the list.

+ 

+ Backup server test

+    At the minimum, we should make sure that servers from the backup list

+    are written to the kdcinfo files. If the option would implement the split

+    ``total:backup`` value, then those should be tested as well.

+ 

+ (Optional) writing a previously tried, not working server

+    If it is agreed during design review that also not working servers are to

+    be written to the kdcinfo files (see the section about not working

+    servers), then a test case should make sure those

+    are written to the end of the list.

+ 

+ SRV resolution test

+    Leave the server list (e.g. ``krb5_server``) option empty. Make sure

+    a DNS SRV query for the configured realm returns valid servers and

+    they are written to the config file.

+ 

+ Combined SRV and server list

+    Set the ``krb5_server`` option to ``hostname, _srv_``. Set the

+    ``krb5_kdcinfo_lookahead`` option to a value greater than 1. Make

+    sure that the host names from the DNS SRV query are also present

+    in the kdcinfo files.

+ 

+ IPA client test

+    The test cases above should be repeated for an IPA client as well in

+    case the IPA online callbacks are modified.

+ 

+ AD site test

+    Add an AD client to a site or set the site in the config file. Make

+    sure that the servers from the site are written first, followed

+    by the global servers up to the ``krb5_kdcinfo_lookahead`` value.

+ 

+ How To Debug

+ ------------

+ Any new code must be decorated with DEBUG messages. To debug the locator

+ plugin changes, using ``KRB5_TRACE`` or even calling ``strace`` might be

+ useful.

+ 

+ Future development

+ ------------------

+ First, it might be useful to extend the resolver or fail over code to resolve

+ the names on its own to save some potentially blocking calls in the plugin.

+ There is already an example of ``resolv_hostport_list_send`` that can perhaps

+ be reused.

+ 

+ Additionally, we already plan for some time to include connectivity checks

+ with cLDAP ping or just plain ``connect()`` to make sure that servers that

+ cannot be contacted at all are not tried. This is of course outside of the

+ scope of this work, but should be kept in mind to not implement something

+ incompatible.

+ 

+ Authors

+ -------

+  * Sumit Bose <sbose@redhat.com>

+  * Tomas Halman <thalman@redhat.com>

+  * Jakub Hrozek <jhrozek@redhat.com>

Design document about multiple server names or addresses in the kdcinfo files and enabling locator plugin fail over

Pull-Request has been closed by jhrozek

5 years ago