#8544 After reboot: Replication bind with GSSAPI auth failed
Closed: fixed 3 months ago by abbra. Opened a year ago by cheimes.

Issue

Replication keeps failing with GSSAPI auth failures after an IPA server has been rebooted. Even after 30 minutes replication is still failing. The test cluster has a non-trivial amount of data (5k users, 4k hosts, lots of groups + hostgroups + sudo + HBAC) and the VMs have rather slow I/O.

The root cause of the initial GSSAPI auth problem may be a clock issue on the test system. After 389-DS has been started and shortly KDC is up, chronyd jumps the clock by about 2 hours. I assume that this clock skew is causing the auth issue. The skew is so large that a remote KDC/service on another server does not accept the ticket.

Steps to Reproduce

  1. install cluster of multiple IPA servers (with DNS server on all nodes)
  2. load data
  3. reboot one node and have it jump the clock after reboot

Actual behavior

cn=meTovm-171-180.ipa.example,cn=replica,cn=dc\3Dipa\,dc\3Dexample,cn=mapping tree,cn=config contains
nsds5replicaLastUpdateStatusJSON: {"state": "red", "ldap_rc": "49", "ldap_rc_text": "Invalid credentials", "repl_rc": "16", "repl_rc_text": "connection error", "date": "2020-10-15T12:50:40Z", "message": "Error (49) Problem connecting to replica - LDAP error: Invalid credentials (connection error)"}

dirsrv error log:

[15/Oct/2020:14:50:05.849083300 +0200] - INFO - slapd_daemon - slapd started.  Listening on All Interfaces port 389 for LDAP requests
[15/Oct/2020:14:50:05.879700312 +0200] - INFO - slapd_daemon - Listening on All Interfaces port 636 for LDAPS requests
[15/Oct/2020:14:50:05.925486952 +0200] - INFO - slapd_daemon - Listening on /var/run/slapd-ABC-EXAMPLE.socket for LDAPI requests
[15/Oct/2020:14:50:05.963754463 +0200] - ERR - schema-compat-plugin - schema-compat-plugin tree scan will start in about 5 seconds!
[15/Oct/2020:14:50:18.700810909 +0200] - ERR - NSMMReplicationPlugin - bind_and_check_pwp - agmt="cn=meTovm-171-180.ipa.example" (vm-171-180:389) - Replication bind with GSSAPI auth failed: LDAP error 49 (Invalid credentials) (SASL(-13): authentication failure: GSSAPI Failure: gss_accept_sec_context)
[15/Oct/2020:14:50:19.440476557 +0200] - ERR - NSMMReplicationPlugin - bind_and_check_pwp - agmt="cn=meTovm-171-245.ipa.example" (vm-171-245:389) - Replication bind with GSSAPI auth failed: LDAP error 49 (Invalid credentials) (SASL(-13): authentication failure: GSSAPI Failure: gss_accept_sec_context)
[15/Oct/2020:12:51:10.455341702 +0200] - ERR - schema-compat-plugin - warning: no entries set up under cn=computers, cn=compat,dc=abc,dc=example
[15/Oct/2020:12:51:10.458233497 +0200] - ERR - schema-compat-plugin - Finished plugin initialization.

krb5kdc.log

Oct 15 14:50:18 vm-178.ipa.example krb5kdc[1222](info): TGS_REQ (6 etypes {aes256-cts-hmac-sha1-96(18), aes256-cts-hmac-sha384-192(20), camellia256-cts-cmac(26), aes128-cts-hmac-sha1-96(17), aes128-cts-hmac-sha256-128(19), camellia128-cts-cmac(25)}) 10.37.170.178: ISSUE: authtime 1602766214, etypes {rep=aes256-cts-hmac-sha1-96(18), tkt=aes256-cts-hmac-sha1-96(18), ses=aes256-cts-hmac-sha1-96(18)}, ldap/vm-178.ipa.example@IPA.EXAMPLE for ldap/vm-171-180.ipa.example@IPA.EXAMPLE
Oct 15 14:50:18 vm-178.ipa.example krb5kdc[1222](info): closing down fd 11
Oct 15 14:50:18 vm-178.ipa.example krb5kdc[1222](info): AS_REQ (6 etypes {aes256-cts-hmac-sha1-96(18), aes256-cts-hmac-sha384-192(20), camellia256-cts-cmac(26), aes128-cts-hmac-sha1-96(17), aes128-cts-hmac-sha256-128(19), camellia128-cts-cmac(25)}) 10.37.170.178: NEEDED_PREAUTH: ldap/vm-178.ipa.example@IPA.EXAMPLE for krbtgt/IPA.EXAMPLE@IPA.EXAMPLE, Additional pre-authentication required
Oct 15 14:50:18 vm-178.ipa.example krb5kdc[1222](info): closing down fd 11
Oct 15 14:50:18 vm-178.ipa.example krb5kdc[1222](info): AS_REQ (6 etypes {aes256-cts-hmac-sha1-96(18), aes256-cts-hmac-sha384-192(20), camellia256-cts-cmac(26), aes128-cts-hmac-sha1-96(17), aes128-cts-hmac-sha256-128(19), camellia128-cts-cmac(25)}) 10.37.170.178: ISSUE: authtime 1602766218, etypes {rep=aes256-cts-hmac-sha1-96(18), tkt=aes256-cts-hmac-sha1-96(18), ses=aes256-cts-hmac-sha1-96(18)}, ldap/vm-178.ipa.example@IPA.EXAMPLE for krbtgt/IPA.EXAMPLE@IPA.EXAMPLE
Oct 15 14:50:18 vm-178.ipa.example krb5kdc[1222](info): closing down fd 11
Oct 15 14:50:18 vm-178.ipa.example krb5kdc[1222](info): TGS_REQ (6 etypes {aes256-cts-hmac-sha1-96(18), aes256-cts-hmac-sha384-192(20), camellia256-cts-cmac(26), aes128-cts-hmac-sha1-96(17), aes128-cts-hmac-sha256-128(19), camellia128-cts-cmac(25)}) 10.37.170.178: ISSUE: authtime 1602766218, etypes {rep=aes256-cts-hmac-sha1-96(18), tkt=aes256-cts-hmac-sha1-96(18), ses=aes256-cts-hmac-sha1-96(18)}, ldap/vm-178.ipa.example@IPA.EXAMPLE for ldap/vm-171-245.ipa.example@IPA.EXAMPLE
Oct 15 14:50:18 vm-178.ipa.example krb5kdc[1222](info): closing down fd 11
Oct 15 14:50:22 vm-178.ipa.example krb5kdc[1224](info): AS_REQ (6 etypes {aes256-cts-hmac-sha1-96(18), aes256-cts-hmac-sha384-192(20), camellia256-cts-cmac(26), aes128-cts-hmac-sha1-96(17), aes128-cts-hmac-sha256-128(19), camellia128-cts-cmac(25)}) 10.37.170.178: NEEDED_PREAUTH: ldap/vm-178.ipa.example@IPA.EXAMPLE for krbtgt/IPA.EXAMPLE@IPA.EXAMPLE, Additional pre-authentication required
Oct 15 14:50:22 vm-178.ipa.example krb5kdc[1224](info): closing down fd 11
# # systemctl status chronyd.service 
● chronyd.service - NTP client/server
     Loaded: loaded (/usr/lib/systemd/system/chronyd.service; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2020-10-15 14:49:19 CEST; 1h 27min left
...
Oct 15 14:49:18 localhost systemd[1]: Starting NTP client/server...
Oct 15 14:49:19 localhost chronyd[654]: chronyd version 3.5.1 starting (+CMDMON +NTP +REFCLOCK +RTC +PRIVDROP +SCFILTER +SIGND +ASYNCDNS +SECHASH +IPV6 +DEBUG)
Oct 15 14:49:19 localhost chronyd[654]: Frequency 6.727 +/- 0.021 ppm read from /var/lib/chrony/drift
Oct 15 14:49:19 localhost chronyd[654]: Using right/UTC timezone to obtain leap second data
Oct 15 14:49:19 localhost systemd[1]: Started NTP client/server.
Oct 15 14:50:53 vm-178.ipa.example chronyd[654]: Selected source 74.6.168.72
Oct 15 14:50:53 vm-178.ipa.example chronyd[654]: System clock TAI offset set to 37 seconds
Oct 15 14:50:53 vm-178.ipa.example chronyd[654]: System clock wrong by -7198.506866 seconds, adjustment started
Oct 15 12:50:54 vm-178.ipa.example chronyd[654]: System clock was stepped by -7198.506866 seconds
# systemctl status krb5kdc.service 
● krb5kdc.service - Kerberos 5 KDC
     Loaded: loaded (/usr/lib/systemd/system/krb5kdc.service; disabled; vendor preset: disabled)
     Active: active (running) since Thu 2020-10-15 14:50:07 CEST; 1h 26min left
...
Oct 15 14:50:06 vm-178.ipa.example systemd[1]: Starting Kerberos 5 KDC...
Oct 15 14:50:07 vm-178.ipa.example systemd[1]: krb5kdc.service: Can't open PID file /run/krb5kdc.pid (yet?) after start: Operation not permitted
Oct 15 14:50:07 vm-178.ipa.example systemd[1]: Started Kerberos 5 KDC.

Expected behavior

Server should resume replication.

Version/Release/Distribution

freeipa-server-4.8.10-5.fc32.x86_64
freeipa-client-4.8.10-5.fc32.x86_64
package ipa-server is not installed
package ipa-client is not installed
389-ds-base-1.4.3.12-1.fc32.x86_64
pki-ca-10.9.4-1.fc32.noarch
krb5-server-1.18.2-22.fc32.x86_64

Additional info:

The IPACA replication entries have recovered automatically after 15 minutes:

cn=meTovm-171-180.ipa.example,cn=replica,cn=o\3Dipaca,cn=mapping tree,cn=config contains
nsds5replicaLastUpdateStatusJSON: {"state": "green", "ldap_rc": "0", "ldap_rc_text": "Success", "repl_rc": "0", "repl_rc_text": "replica acquired", "date": "2020-10-15T11:05:00Z", "message": "Error (0) Replica acquired successfully: Incremental update succeeded"}

The domain database replication entries are still failing after 30 minutes.

I think there are two bugs here:

  • 389-DS does not resume replication after auth failure early enough. It might also be a good idea to wipe GSSAPI context and ccache after a GSSAPI authentication error.
  • ipa.service starts before chronyd.service has corrected the clock after a reboot. IPA should wait until a clock provider (chronyd.service, ntpd.service, systemd-timesyncd.service) has finished.

interesting. Sadly, there is no common event to bind to. Services like you mention aren't finish, they continue running as they are daemons. We need some resource that is provided by them after they successfully acquired network time.

A crude version would be to add a preexec script that loops over timedatectl show:

timedatectl show |grep NTPSynchronized=yes

But this would break for systems that don't use network time synchronization and might delay indifinitely...

There is a target time-sync.target which we can bind to:

After=time-sync.target
Requires=time-sync.target

The target ensures that chrony-wait.service, ntpdate.service, systemd-timesyncd.service, or any other time provider have reached sync state.

It is inactive for me -- as well as chrony-wait.service, ntpdate.service, systemd-timesyncd.service.

I have active chronyd.service and timedatectl shows me that NTP is synchronized. But the target is not active. So I guess this is not uniform.

time-sync.target is a passive target. It will automatically engage when another unit depends on the target. The chrony-wait.service unit is another problem. It is not enabled by default.

I just figured out another issue: We cannot use time-sync.target for ipa.service on IPA servers with DNS service. chronyd needs working DNS to lookup upstream NTP servers.ipa-dns-install configures /etc/resolv.conf to use 127.0.0.1 as DNS provider with local BIND DNS instance. IPA's BIND DNS server depends on 389-DS LDAP server to provide configuration data like DNS forwarders. Chicken and egg...

Alexander, Simo, and I discussed the matter. We came to the conclusion that we have to change the start order of services. The BIND DNS instance should be started right after 389-DS and before KDC. This would allow chronyd to sync time before KDC is started.

bind-dyndb-ldap currently depends on GSSAPI authentication with keytab. Since it already uses LDAPI we could use EXTERNAL autobind with uid/gid instead. However this would require a change to 389-DS' autobind feature to some kind of referals / aliasing. Since uid/gid are installation, OS, and machine specific we want autobind to look up entries in cn=config and then have 389-DS authenticate as the target DN authenticateAsDN (or whatever we want to call it).

dn: cn=bind-dyndb-ldap,cn=autobind,cn=config
cn: bind-dyndb-ldap
uidNumber: <uid-of-named>
gidNumber: <gid-of-named>
authenticateAsDN: krbprincipalname=DNS/$HOSTNAME@$REALM,cn=services,cn=accounts,$SUFFIX

I noticed that 389-DS resumes replication after a little more than 2 hours. That's about the same duration as the system clock is moved backwards when chronyd kicks in.

It seems to be a problem with NSPR's PR_WaitCondVar. On pthread platforms the function is built on top of pthread_cond_timedwait. The function takes an absolute time. Both pthread_cond_timedwait and PR_WaitCondVar internal helper pt_TimedWait default to realtime clock which is subject to clock changes. pthread_condattr_setclock with CLOCK_MONOTONIC could be used to change the clock source from real to monotonic clock. This fix would require changes or even a new API in NSPR.

master:

  • 16e1cbd LDAP autobind authenticateAsDN for BIND named
  • 5b8f37f Add design for LDAPI autobind

Metadata Update from @abbra:
- Issue close_status updated to: fixed
- Issue status updated to: Closed (was: Open)

3 months ago

Login to comment on this ticket.

Metadata