#6587 ipa-otpd: systemctl reports "degraded" for "is-system-running" after todays CentOS updates
Closed: fixed a year ago by frenaud. Opened 5 years ago by jh23453.

Today I've received new ipa packages via CentOS.
After the installation IPA was restarted successfully, but
"systemctl is-system-running" reports:

 jochen@freeipa1:~$ systemctl is-system-running
 degraded

When I look for failed services I see ipa-otpd:

jochen@freeipa1:~$ systemctl list-units | grep failed
● ipa-otpd@0-2216-0.service   loaded failed failed    ipa-otpd service
jochen@freeipa1:~$ systemctl list-units | grep ipa-otpd
● ipa-otpd@0-2216-0.service   loaded failed failed    ipa-otpd service
  ipa-otpd@0-28372-0.service  loaded active running   ipa-otpd service (PID 28372/UID 0)
  ipa-otpd.socket             loaded active listening ipa-otpd socket

It would be nice to have no "failed" services.


could you paste output of systemctl status ipa-otpd@0-2216-0.service ?

Sorry, I don't have the failing status any more due to a reboot. But the messages from that time are:

Jan  3 08:31:29 freeipa1 yum[25930]: Updated: ipa-server-4.4.0-14.el7.centos.1.1.x86_64
Jan  3 08:31:30 freeipa1 yum[25930]: Updated: ipa-server-dns-4.4.0-14.el7.centos.1.1.noarch
Jan  3 08:31:31 freeipa1 yum[25930]: Updated: ipa-python-compat-4.4.0-14.el7.centos.1.1.noarch
Jan  3 08:31:48 freeipa1 systemd: Stopping 389 Directory Server JOCHEN-ORG....
Jan  3 08:31:48 freeipa1 ns-slapd: [03/Jan/2017:08:31:48.955310211 +0100] slapd shutting down - signaling operation threads - op stack size 28 max work 
q size 4 max work q stack size 4
Jan  3 08:31:49 freeipa1 ns-slapd: [03/Jan/2017:08:31:49.139476379 +0100] slapd shutting down - closing down internal subsystems and plugins
Jan  3 08:31:50 freeipa1 ns-slapd: [03/Jan/2017:08:31:50.964703556 +0100] Waiting for 4 database threads to stop
Jan  3 08:31:51 freeipa1 ns-slapd: [03/Jan/2017:08:31:51.545177886 +0100] All database threads now stopped
Jan  3 08:31:51 freeipa1 ns-slapd: [03/Jan/2017:08:31:51.634908943 +0100] slapd shutting down - freed 4 work q stack objects - freed 30 op stack objects
Jan  3 08:31:51 freeipa1 named-pkcs11[2230]: LDAP error: Can't contact LDAP server: ldap_sync_poll() failed
Jan  3 08:31:51 freeipa1 named-pkcs11[2230]: ldap_syncrepl will reconnect in 60 seconds
Jan  3 08:31:51 freeipa1 ipa-otpd: query.c:137: Input/output error: IO error received on query socket
Jan  3 08:31:51 freeipa1 systemd: ipa-otpd@0-2216-0.service: main process exited, code=exited, status=1/FAILURE
Jan  3 08:31:51 freeipa1 systemd: Unit ipa-otpd@0-2216-0.service entered failed state.
Jan  3 08:31:51 freeipa1 systemd: ipa-otpd@0-2216-0.service failed.
Jan  3 08:31:52 freeipa1 ns-slapd: [03/Jan/2017:08:31:52.363467272 +0100] slapd stopped.
Jan  3 08:31:52 freeipa1 ipa-dnskeysyncd: ipa         : ERROR    syncrepl_poll: LDAP error ({'desc': "Can't contact LDAP server"})
Jan  3 08:31:52 freeipa1 ipa-dnskeysyncd: Traceback (most recent call last):
Jan  3 08:31:52 freeipa1 ipa-dnskeysyncd: File "/usr/libexec/ipa/ipa-dnskeysyncd", line 110, in <module>
Jan  3 08:31:52 freeipa1 ipa-dnskeysyncd: while ldap_connection.syncrepl_poll(all=1, msgid=ldap_search):
Jan  3 08:31:52 freeipa1 ipa-dnskeysyncd: File "/usr/lib64/python2.7/site-packages/ldap/syncrepl.py", line 348, in syncrepl_poll
Jan  3 08:31:52 freeipa1 ipa-dnskeysyncd: add_intermediates=1, add_ctrls=1, all = 0
Jan  3 08:31:52 freeipa1 ipa-dnskeysyncd: File "/usr/lib64/python2.7/site-packages/ldap/ldapobject.py", line 476, in result4
Jan  3 08:31:52 freeipa1 ipa-dnskeysyncd: ldap_result = self._ldap_call(self._l.result4,msgid,all,timeout,add_ctrls,add_intermediates,add_extop)
Jan  3 08:31:52 freeipa1 ipa-dnskeysyncd: File "/usr/lib64/python2.7/site-packages/ldap/ldapobject.py", line 99, in _ldap_call
Jan  3 08:31:52 freeipa1 ipa-dnskeysyncd: result = func(*args,**kwargs)
Jan  3 08:31:52 freeipa1 ipa-dnskeysyncd: SERVER_DOWN: {'desc': "Can't contact LDAP server"}
Jan  3 08:31:52 freeipa1 systemd: Stopped 389 Directory Server JOCHEN-ORG..
Jan  3 08:31:53 freeipa1 systemd: Starting 389 Directory Server JOCHEN-ORG....

Hope that helps.

This is caused by invalid order of shutting down service. IPA should shutdown in reverse order than start.

Metadata Update from @jh23453:
- Issue assigned to someone
- Issue set to the milestone: Future Releases

5 years ago

It's unclear that there is anything to fix here, that it was a one-off. We have had no further reports of this type, closing as worksforme.

Metadata Update from @rcritten:
- Issue close_status updated to: worksforme
- Issue status updated to: Closed (was: Open)

3 years ago

I have observed this with ipa-server-4.8.0-13.module_el8.1.0+265+e1e65be4.x86_64. Have seen it also under Fedora. In journalctl there's:

May 27 10:08:38 skipper.[my domain] systemd[1]: Started ipa-otpd service (PID 3992/UID 0).
May 27 10:08:38 skipper.[my domain] ipa-otpd[16008]: LDAP: ldapi://%2Fvar%2Frun%2Fslapd-[my realm].socket
May 27 10:08:38 skipper.[my domain] ipa-otpd[16008]: james@[my realm]: request received
May 27 10:08:38 skipper.[my domain] ipa-otpd[16008]: james@[my realm] user query start
May 27 10:08:38 skipper.[my domain] ipa-otpd[16008]: james@[my realm]: user query end: uid=james,cn=users,cn=accounts,dc=cb,dc=[my dc]
May 27 10:08:38 skipper.[my domain] ipa-otpd[16008]: james@[my realm]: bind start: uid=james,cn=users,cn=accounts,dc=cb,dc=[my dc]
May 27 10:08:38 skipper.[my domain] ipa-otpd[16008]: james@[my realm]: bind end: success
May 27 10:08:38 skipper.[my domain] ipa-otpd[16008]: james@[my realm]: response sent: Access-Accept
May 27 11:09:34 skipper.[my domain] ipa-otpd[16008]: bind.c:088: Input/output error: IO error received on bind socket
May 27 11:09:34 skipper.[my domain] systemd[1]: ipa-otpd@1-3992-0.service: Main process exited, code=exited, status=1/FAILURE
May 27 11:09:34 skipper.[my domain] systemd[1]: ipa-otpd@1-3992-0.service: Failed with result 'exit-code'.

But OTP otherwise works. Seems to happen after short uptime; the account in question is configured with 2 tokens, one HOPT and the other TOTP. Will keep an eye on the exact conditions.

Same happening here, just appears every once in a while for no apparent reason.

Happening on ipa-server-4.8.4-7.module_el8.2.0+374+0d2d74a1.x86_64.

We have systemd monitoring in place, and at least every day we get one instantiated ipa-otpd service failing. Looking if we can switch to a full service and not a socket based activated one.

Metadata Update from @rcritten:
- Custom field affects_doc adjusted to on
- Custom field knownissue adjusted to on
- Issue status updated to: Open (was: Closed)

2 years ago

Additional logging could be useful, and any context on what was going on at the time.

we have a FreeRadius configuration that checks the OTP for our VPN solution agains FreeIPA. There is no more log than the one at https://pagure.io/freeipa/issue/6587#comment-654748

How can I enable additional debugging on ipa-otpd?

There is no additional debugging for ipa-otpd beyond the journal. But it talks to LDAP so the 389-ds access and error logs may be useful. And context can be important. Is the system under heavy load, were processes restarted, etc.

There is no additional debugging for ipa-otpd beyond the journal. But it talks to LDAP so the 389-ds access and error logs may be useful. And context can be important. Is the system under heavy load, were processes restarted, etc.

My system isn't under heavy load; the machine in question also doubles as a NFS server, DHCP, DNS, etc and spends most of its time idle. Not sure what triggers it yet. I've just reset-failed in systemd, tried with both TOTP and HOTP tokens entering both valid and incorrect codes -- no failure so far.

On logs: I don't know what specifically to look for, but cross-correlating the timestamps. ipa-optd log:

Jul 02 16:31:41 skipper.my.domain ipa-otpd[9289]: bind.c:088: Input/output error: IO error received on bind socket
Jul 02 16:31:41 skipper.my.domain systemd[1]: ipa-otpd@5-2512-0.service: Main process exited, code=exited, status=1/FAILURE
Jul 02 16:31:41 skipper.my.domain systemd[1]: ipa-otpd@5-2512-0.service: Failed with result 'exit-code'.

dirsrv access log:

[02/Jul/2020:16:31:41.002617642 +0100] conn=143 op=-1 fd=116 closed error 11 (Resource temporarily unavailable) - T1
[02/Jul/2020:16:31:41.002731317 +0100] conn=157 op=11 UNBIND
[02/Jul/2020:16:31:41.002771449 +0100] conn=157 op=11 fd=75 closed - U1
[02/Jul/2020:16:31:41.004876025 +0100] conn=142 op=3 UNBIND
[02/Jul/2020:16:31:41.004937185 +0100] conn=142 op=3 fd=115 closed - U1

Nothing in errors log.

Hello,

James pagure@pagure.io writes:

My system isn't under heavy load; the machine in question also doubles
as a NFS server, DHCP, DNS, etc and spends most of its time idle. Not
sure what triggers it yet. I've just reset-failed in systemd, tried
with both TOTP and HOTP tokens entering both valid and incorrect codes
-- no failure so far.

Same for me. My best guess is that IPA services have been restarted due
to packages upgrades. When looking at
https://pagure.io/freeipa/issue/6587#comment-315945 I see that dirsrv
has been stopped and ipa-otpd fails. Right now I've disabled OTP on my
users, so I can't easily reproduce - but I'll see what I can do.
Can you try the following?

  • Do an authentication with OTP. There should be an ipa-otpd service
    started an lingering around. Can we get a process list "ps aux | grep
    ipa-otp"? It should be a child process of PID 1 - I think.
  • Do "systemctl restart dirsrv@REALM.service"
  • Is "systemctrl is-system-running" degraded?

I had a look at the service file of ipa-otpd,
/usr/lib/systemd/system/ipa-otpd@.service contains:

,----
| [Unit]
| Description=ipa-otpd service
|
| [Service]
| EnvironmentFile=/etc/ipa/default.conf
| ExecStart=/usr/libexec/ipa/ipa-otpd $ldap_uri
| StandardInput=socket
| StandardOutput=socket
| StandardError=syslog
`----

Nothing with retry/restart (I'm unsure if that could possibly
help). There has been one commit to ipa-otpd after the inital commit:
https://github.com/freeipa/freeipa/commit/0756ce7d53133793b82674757e125524eede4721#diff-e225c5d5b5f1e58c50a232c93830b0ad

That commit contains this hunk:

,----
| otpd_log_err(EIO, "IO error received on query socket");
| verto_break(ctx.vctx);
| ctx.exitstatus = 1;
| return;
`----

What would happen if we exit with status 0? Then the unit shouldn't
fail. I'm not sure if that would be a correct solution...

There has been a commit to the sockets service file:
https://github.com/freeipa/freeipa/commit/d05d1115e409962fee3576a4bfc5cecfacef4fd3#diff-d69764b776aed9de30b04f89c35459d4
But I think that commit shouldn't change the problem at all.

More guesses that anything else, but I think we are getting closer.

Jochen

--
This space is intentionally left blank.

Nothing special for us. We have 3 FreeIPA servers, all of them mostly idle, they only run FreeIPA + FreeRadius.

Users come and go (~150 logins per day) through the VPN, and each VPN login checks the OTP token with a custom FreeRadius plugin that just does an ldap query on the ldap socket for a single object. Every few days (let's say every 1000 logins) we have one random ipa-otpd failure as above on one of the nodes.

We just had to reboot 2 of our 3 servers because otpd failed with this error and wouldn't restart. A fix would be really appreciated.

I got a list of failed services, and did systemctl status on them. They all failed exactly 1 hour after the last activity. I conjecture that slapd is doing an idle timeout. If that's so, it may be a mistake to exit claiming a failure.

Hi @jh23453 , we're also consistently experiencing this on our systems, and I can gladly debug for more info on this. The static unit stops with the same error message precisely one hour after the last communication.

Could you please elaborate on how to initiate an OTP authorization?
We don't use OTP as a feature so I'm not familiar with the internals and tools here.

Alright, I managed to spin it up.

Indeed, initiating an OTP login and manually restarting the dirsrv causes the same symptoms:

● ipa-otpd@61-848957-0.service - ipa-otpd service (PID 848957/UID 0)
   Loaded: loaded (/usr/lib/systemd/system/ipa-otpd@.service; static; vendor preset: disabled)
   Active: active (running) since Wed 2021-03-31 21:03:56 UTC; 8min ago
 Main PID: 2320872 (ipa-otpd)
    Tasks: 1 (limit: 24981)
   Memory: 2.2M
   CGroup: /system.slice/system-ipa\x2dotpd.slice/ipa-otpd@61-848957-0.service
           └─2320872 /usr/libexec/ipa/ipa-otpd ldapi://%2fvar%2frun%2fslapd-AGDSN-DE.socket

Mar 31 21:03:56 myhost systemd[1]: Started ipa-otpd service (PID 848957/UID 0).
Mar 31 21:03:56 myhost ipa-otpd[2320872]: LDAP: ldapi://%2fvar%2frun%2fslapd-AGDSN-DE.socket
Mar 31 21:03:56 myhost ipa-otpd[2320872]: lukasjuhrich@AGDSN.DE: request received
Mar 31 21:03:56 myhost ipa-otpd[2320872]: lukasjuhrich@AGDSN.DE: user query start
Mar 31 21:03:56 myhost ipa-otpd[2320872]: lukasjuhrich@AGDSN.DE: user query end: uid=lukasjuhrich,cn=users,cn=accounts,dc=agdsn,dc=de
Mar 31 21:03:56 myhost ipa-otpd[2320872]: lukasjuhrich@AGDSN.DE: bind start: uid=lukasjuhrich,cn=users,cn=accounts,dc=agdsn,dc=de
Mar 31 21:03:56 myhost ipa-otpd[2320872]: lukasjuhrich@AGDSN.DE: bind end: success
Mar 31 21:03:56 myhost ipa-otpd[2320872]: lukasjuhrich@AGDSN.DE: response sent: Access-Accept
[root@idm0 2320872]# sd restart dirsrv@AGDSN-DE.service
[root@idm0 2320872]# sd status ipa-otpd@61-848957-0.service
● ipa-otpd@61-848957-0.service - ipa-otpd service (PID 848957/UID 0)
   Loaded: loaded (/usr/lib/systemd/system/ipa-otpd@.service; static; vendor preset: disabled)
   Active: failed (Result: exit-code) since Wed 2021-03-31 21:12:31 UTC; 6s ago
  Process: 2320872 ExecStart=/usr/libexec/ipa/ipa-otpd $ldap_uri (code=exited, status=1/FAILURE)
 Main PID: 2320872 (code=exited, status=1/FAILURE)

Mar 31 21:03:56 myhost ipa-otpd[2320872]: LDAP: ldapi://%2fvar%2frun%2fslapd-AGDSN-DE.socket
Mar 31 21:03:56 myhost ipa-otpd[2320872]: lukasjuhrich@AGDSN.DE: request received
Mar 31 21:03:56 myhost ipa-otpd[2320872]: lukasjuhrich@AGDSN.DE: user query start
Mar 31 21:03:56 myhost ipa-otpd[2320872]: lukasjuhrich@AGDSN.DE: user query end: uid=lukasjuhrich,cn=users,cn=accounts,dc=agdsn,dc=de
Mar 31 21:03:56 myhost ipa-otpd[2320872]: lukasjuhrich@AGDSN.DE: bind start: uid=lukasjuhrich,cn=users,cn=accounts,dc=agdsn,dc=de
Mar 31 21:03:56 myhost ipa-otpd[2320872]: lukasjuhrich@AGDSN.DE: bind end: success
Mar 31 21:03:56 myhost ipa-otpd[2320872]: lukasjuhrich@AGDSN.DE: response sent: Access-Accept
Mar 31 21:12:31 myhost ipa-otpd[2320872]:    query.c:137: Input/output error: IO error received on query socket
Mar 31 21:12:31 myhost systemd[1]: ipa-otpd@61-848957-0.service: Main process exited, code=exited, status=1/FAILURE
Mar 31 21:12:31 myhost systemd[1]: ipa-otpd@61-848957-0.service: Failed with result 'exit-code'.

Yep! It's the idle timeout!
indeed:

[root@idm0 slapd-AGDSN-DE]# ldapsearch -LLo ldif-wrap=no -D "cn=directory manager" -W -b 'cn=config' -s base nsslapd-idletimeout
Enter LDAP Password: 
version: 1

dn: cn=config
nsslapd-idletimeout: 3600

…and after
- resetting all failed units
- setting nsslapd-idletimeout to 60
- restarting dirsrv@blah
- re-initiating the OTP auth
- waiting 60s

I get the same log with a 60s interval in the relevant lines:

[root@idm0 slapd-AGDSN-DE]# sd status ipa-otpd@62-848960-0.service
● ipa-otpd@62-848960-0.service - ipa-otpd service (PID 848960/UID 0)
   Loaded: loaded (/usr/lib/systemd/system/ipa-otpd@.service; static; vendor preset: disabled)
   Active: failed (Result: exit-code) since Wed 2021-03-31 21:29:58 UTC; 1min 1s ago
  Process: 2322932 ExecStart=/usr/libexec/ipa/ipa-otpd $ldap_uri (code=exited, status=1/FAILURE)
 Main PID: 2322932 (code=exited, status=1/FAILURE)

Mar 31 21:28:58 myhost ipa-otpd[2322932]: LDAP: ldapi://%2fvar%2frun%2fslapd-AGDSN-DE.socket
Mar 31 21:28:58 myhost ipa-otpd[2322932]: lukasjuhrich@AGDSN.DE: request received
Mar 31 21:28:58 myhost ipa-otpd[2322932]: lukasjuhrich@AGDSN.DE: user query start
Mar 31 21:28:58 myhost ipa-otpd[2322932]: lukasjuhrich@AGDSN.DE: user query end: uid=lukasjuhrich,cn=users,cn=accounts,dc=agdsn,dc=de
Mar 31 21:28:58 myhost ipa-otpd[2322932]: lukasjuhrich@AGDSN.DE: bind start: uid=lukasjuhrich,cn=users,cn=accounts,dc=agdsn,dc=de
Mar 31 21:28:58 myhost ipa-otpd[2322932]: lukasjuhrich@AGDSN.DE: bind end: success
Mar 31 21:28:58 myhost ipa-otpd[2322932]: lukasjuhrich@AGDSN.DE: response sent: Access-Accept
Mar 31 21:29:58 myhost ipa-otpd[2322932]:     bind.c:088: Input/output error: IO error received on bind socket
Mar 31 21:29:58 myhost systemd[1]: ipa-otpd@62-848960-0.service: Main process exited, code=exited, status=1/FAILURE
Mar 31 21:29:58 myhost systemd[1]: ipa-otpd@62-848960-0.service: Failed with result 'exit-code'.

So… is the solution here to exit 0? Or am I missing context as to why that might not be desirable

master:

  • 1cc2671 ipa-otpd: handle LDAP timeout in a better way

ipa-4-6:

  • 2b414db ipa-otpd: handle LDAP timeout in a better way

ipa-4-8:

  • 5d4d490 ipa-otpd: handle LDAP timeout in a better way

ipa-4-9:

  • 33404a6 ipa-otpd: handle LDAP timeout in a better way

Metadata Update from @frenaud:
- Issue close_status updated to: fixed
- Issue status updated to: Closed (was: Open)

a year ago

master:

  • 81b6157 ipatests: Test for OTP when the LDAP connection timed out.

ipa-4-9:

  • 25a4acf ipatests: Test for OTP when the LDAP connection timed out.

ipa-4-8:

  • b3860cb ipatests: Test for OTP when the LDAP connection timed out.

ipa-4-6:

  • 5b1b5ae ipatests: Test for OTP when the LDAP connection timed out.

master:

  • 16b3788 ipatests: Refactor test_check_otpd_after_idle_timeout

ipa-4-6:

  • 3b79b27 ipatests: Refactor test_check_otpd_after_idle_timeout

ipa-4-9:

  • eac03d6 ipatests: Refactor test_check_otpd_after_idle_timeout

ipa-4-8:

  • 827ff65 ipatests: Refactor test_check_otpd_after_idle_timeout

Login to comment on this ticket.

Metadata