Today I've received new ipa packages via CentOS. After the installation IPA was restarted successfully, but "systemctl is-system-running" reports:
jochen@freeipa1:~$ systemctl is-system-running degraded
When I look for failed services I see ipa-otpd:
jochen@freeipa1:~$ systemctl list-units | grep failed ● ipa-otpd@0-2216-0.service loaded failed failed ipa-otpd service jochen@freeipa1:~$ systemctl list-units | grep ipa-otpd ● ipa-otpd@0-2216-0.service loaded failed failed ipa-otpd service ipa-otpd@0-28372-0.service loaded active running ipa-otpd service (PID 28372/UID 0) ipa-otpd.socket loaded active listening ipa-otpd socket
It would be nice to have no "failed" services.
could you paste output of systemctl status ipa-otpd@0-2216-0.service ?
systemctl status ipa-otpd@0-2216-0.service
Sorry, I don't have the failing status any more due to a reboot. But the messages from that time are:
Jan 3 08:31:29 freeipa1 yum[25930]: Updated: ipa-server-4.4.0-14.el7.centos.1.1.x86_64 Jan 3 08:31:30 freeipa1 yum[25930]: Updated: ipa-server-dns-4.4.0-14.el7.centos.1.1.noarch Jan 3 08:31:31 freeipa1 yum[25930]: Updated: ipa-python-compat-4.4.0-14.el7.centos.1.1.noarch Jan 3 08:31:48 freeipa1 systemd: Stopping 389 Directory Server JOCHEN-ORG.... Jan 3 08:31:48 freeipa1 ns-slapd: [03/Jan/2017:08:31:48.955310211 +0100] slapd shutting down - signaling operation threads - op stack size 28 max work q size 4 max work q stack size 4 Jan 3 08:31:49 freeipa1 ns-slapd: [03/Jan/2017:08:31:49.139476379 +0100] slapd shutting down - closing down internal subsystems and plugins Jan 3 08:31:50 freeipa1 ns-slapd: [03/Jan/2017:08:31:50.964703556 +0100] Waiting for 4 database threads to stop Jan 3 08:31:51 freeipa1 ns-slapd: [03/Jan/2017:08:31:51.545177886 +0100] All database threads now stopped Jan 3 08:31:51 freeipa1 ns-slapd: [03/Jan/2017:08:31:51.634908943 +0100] slapd shutting down - freed 4 work q stack objects - freed 30 op stack objects Jan 3 08:31:51 freeipa1 named-pkcs11[2230]: LDAP error: Can't contact LDAP server: ldap_sync_poll() failed Jan 3 08:31:51 freeipa1 named-pkcs11[2230]: ldap_syncrepl will reconnect in 60 seconds Jan 3 08:31:51 freeipa1 ipa-otpd: query.c:137: Input/output error: IO error received on query socket Jan 3 08:31:51 freeipa1 systemd: ipa-otpd@0-2216-0.service: main process exited, code=exited, status=1/FAILURE Jan 3 08:31:51 freeipa1 systemd: Unit ipa-otpd@0-2216-0.service entered failed state. Jan 3 08:31:51 freeipa1 systemd: ipa-otpd@0-2216-0.service failed. Jan 3 08:31:52 freeipa1 ns-slapd: [03/Jan/2017:08:31:52.363467272 +0100] slapd stopped. Jan 3 08:31:52 freeipa1 ipa-dnskeysyncd: ipa : ERROR syncrepl_poll: LDAP error ({'desc': "Can't contact LDAP server"}) Jan 3 08:31:52 freeipa1 ipa-dnskeysyncd: Traceback (most recent call last): Jan 3 08:31:52 freeipa1 ipa-dnskeysyncd: File "/usr/libexec/ipa/ipa-dnskeysyncd", line 110, in <module> Jan 3 08:31:52 freeipa1 ipa-dnskeysyncd: while ldap_connection.syncrepl_poll(all=1, msgid=ldap_search): Jan 3 08:31:52 freeipa1 ipa-dnskeysyncd: File "/usr/lib64/python2.7/site-packages/ldap/syncrepl.py", line 348, in syncrepl_poll Jan 3 08:31:52 freeipa1 ipa-dnskeysyncd: add_intermediates=1, add_ctrls=1, all = 0 Jan 3 08:31:52 freeipa1 ipa-dnskeysyncd: File "/usr/lib64/python2.7/site-packages/ldap/ldapobject.py", line 476, in result4 Jan 3 08:31:52 freeipa1 ipa-dnskeysyncd: ldap_result = self._ldap_call(self._l.result4,msgid,all,timeout,add_ctrls,add_intermediates,add_extop) Jan 3 08:31:52 freeipa1 ipa-dnskeysyncd: File "/usr/lib64/python2.7/site-packages/ldap/ldapobject.py", line 99, in _ldap_call Jan 3 08:31:52 freeipa1 ipa-dnskeysyncd: result = func(*args,**kwargs) Jan 3 08:31:52 freeipa1 ipa-dnskeysyncd: SERVER_DOWN: {'desc': "Can't contact LDAP server"} Jan 3 08:31:52 freeipa1 systemd: Stopped 389 Directory Server JOCHEN-ORG.. Jan 3 08:31:53 freeipa1 systemd: Starting 389 Directory Server JOCHEN-ORG....
Hope that helps.
This is caused by invalid order of shutting down service. IPA should shutdown in reverse order than start.
Metadata Update from @jh23453: - Issue assigned to someone - Issue set to the milestone: Future Releases
It's unclear that there is anything to fix here, that it was a one-off. We have had no further reports of this type, closing as worksforme.
Metadata Update from @rcritten: - Issue close_status updated to: worksforme - Issue status updated to: Closed (was: Open)
I have observed this with ipa-server-4.8.0-13.module_el8.1.0+265+e1e65be4.x86_64. Have seen it also under Fedora. In journalctl there's:
May 27 10:08:38 skipper.[my domain] systemd[1]: Started ipa-otpd service (PID 3992/UID 0). May 27 10:08:38 skipper.[my domain] ipa-otpd[16008]: LDAP: ldapi://%2Fvar%2Frun%2Fslapd-[my realm].socket May 27 10:08:38 skipper.[my domain] ipa-otpd[16008]: james@[my realm]: request received May 27 10:08:38 skipper.[my domain] ipa-otpd[16008]: james@[my realm] user query start May 27 10:08:38 skipper.[my domain] ipa-otpd[16008]: james@[my realm]: user query end: uid=james,cn=users,cn=accounts,dc=cb,dc=[my dc] May 27 10:08:38 skipper.[my domain] ipa-otpd[16008]: james@[my realm]: bind start: uid=james,cn=users,cn=accounts,dc=cb,dc=[my dc] May 27 10:08:38 skipper.[my domain] ipa-otpd[16008]: james@[my realm]: bind end: success May 27 10:08:38 skipper.[my domain] ipa-otpd[16008]: james@[my realm]: response sent: Access-Accept May 27 11:09:34 skipper.[my domain] ipa-otpd[16008]: bind.c:088: Input/output error: IO error received on bind socket May 27 11:09:34 skipper.[my domain] systemd[1]: ipa-otpd@1-3992-0.service: Main process exited, code=exited, status=1/FAILURE May 27 11:09:34 skipper.[my domain] systemd[1]: ipa-otpd@1-3992-0.service: Failed with result 'exit-code'.
But OTP otherwise works. Seems to happen after short uptime; the account in question is configured with 2 tokens, one HOPT and the other TOTP. Will keep an eye on the exact conditions.
Same happening here, just appears every once in a while for no apparent reason.
Happening on ipa-server-4.8.4-7.module_el8.2.0+374+0d2d74a1.x86_64.
We have systemd monitoring in place, and at least every day we get one instantiated ipa-otpd service failing. Looking if we can switch to a full service and not a socket based activated one.
Metadata Update from @rcritten: - Custom field affects_doc adjusted to on - Custom field knownissue adjusted to on - Issue status updated to: Open (was: Closed)
Additional logging could be useful, and any context on what was going on at the time.
we have a FreeRadius configuration that checks the OTP for our VPN solution agains FreeIPA. There is no more log than the one at https://pagure.io/freeipa/issue/6587#comment-654748
How can I enable additional debugging on ipa-otpd?
ipa-otpd
There is no additional debugging for ipa-otpd beyond the journal. But it talks to LDAP so the 389-ds access and error logs may be useful. And context can be important. Is the system under heavy load, were processes restarted, etc.
My system isn't under heavy load; the machine in question also doubles as a NFS server, DHCP, DNS, etc and spends most of its time idle. Not sure what triggers it yet. I've just reset-failed in systemd, tried with both TOTP and HOTP tokens entering both valid and incorrect codes -- no failure so far.
On logs: I don't know what specifically to look for, but cross-correlating the timestamps. ipa-optd log:
Jul 02 16:31:41 skipper.my.domain ipa-otpd[9289]: bind.c:088: Input/output error: IO error received on bind socket Jul 02 16:31:41 skipper.my.domain systemd[1]: ipa-otpd@5-2512-0.service: Main process exited, code=exited, status=1/FAILURE Jul 02 16:31:41 skipper.my.domain systemd[1]: ipa-otpd@5-2512-0.service: Failed with result 'exit-code'.
dirsrv access log:
[02/Jul/2020:16:31:41.002617642 +0100] conn=143 op=-1 fd=116 closed error 11 (Resource temporarily unavailable) - T1 [02/Jul/2020:16:31:41.002731317 +0100] conn=157 op=11 UNBIND [02/Jul/2020:16:31:41.002771449 +0100] conn=157 op=11 fd=75 closed - U1 [02/Jul/2020:16:31:41.004876025 +0100] conn=142 op=3 UNBIND [02/Jul/2020:16:31:41.004937185 +0100] conn=142 op=3 fd=115 closed - U1
Nothing in errors log.
Hello,
James pagure@pagure.io writes:
Same for me. My best guess is that IPA services have been restarted due to packages upgrades. When looking at https://pagure.io/freeipa/issue/6587#comment-315945 I see that dirsrv has been stopped and ipa-otpd fails. Right now I've disabled OTP on my users, so I can't easily reproduce - but I'll see what I can do. Can you try the following?
I had a look at the service file of ipa-otpd, /usr/lib/systemd/system/ipa-otpd@.service contains:
,---- | [Unit] | Description=ipa-otpd service | | [Service] | EnvironmentFile=/etc/ipa/default.conf | ExecStart=/usr/libexec/ipa/ipa-otpd $ldap_uri | StandardInput=socket | StandardOutput=socket | StandardError=syslog `----
Nothing with retry/restart (I'm unsure if that could possibly help). There has been one commit to ipa-otpd after the inital commit: https://github.com/freeipa/freeipa/commit/0756ce7d53133793b82674757e125524eede4721#diff-e225c5d5b5f1e58c50a232c93830b0ad
That commit contains this hunk:
,---- | otpd_log_err(EIO, "IO error received on query socket"); | verto_break(ctx.vctx); | ctx.exitstatus = 1; | return; `----
What would happen if we exit with status 0? Then the unit shouldn't fail. I'm not sure if that would be a correct solution...
There has been a commit to the sockets service file: https://github.com/freeipa/freeipa/commit/d05d1115e409962fee3576a4bfc5cecfacef4fd3#diff-d69764b776aed9de30b04f89c35459d4 But I think that commit shouldn't change the problem at all.
More guesses that anything else, but I think we are getting closer.
Jochen
-- This space is intentionally left blank.
Nothing special for us. We have 3 FreeIPA servers, all of them mostly idle, they only run FreeIPA + FreeRadius.
Users come and go (~150 logins per day) through the VPN, and each VPN login checks the OTP token with a custom FreeRadius plugin that just does an ldap query on the ldap socket for a single object. Every few days (let's say every 1000 logins) we have one random ipa-otpd failure as above on one of the nodes.
We just had to reboot 2 of our 3 servers because otpd failed with this error and wouldn't restart. A fix would be really appreciated.
I got a list of failed services, and did systemctl status on them. They all failed exactly 1 hour after the last activity. I conjecture that slapd is doing an idle timeout. If that's so, it may be a mistake to exit claiming a failure.
Hi @jh23453 , we're also consistently experiencing this on our systems, and I can gladly debug for more info on this. The static unit stops with the same error message precisely one hour after the last communication.
Could you please elaborate on how to initiate an OTP authorization? We don't use OTP as a feature so I'm not familiar with the internals and tools here.
Alright, I managed to spin it up.
Indeed, initiating an OTP login and manually restarting the dirsrv causes the same symptoms:
dirsrv
● ipa-otpd@61-848957-0.service - ipa-otpd service (PID 848957/UID 0) Loaded: loaded (/usr/lib/systemd/system/ipa-otpd@.service; static; vendor preset: disabled) Active: active (running) since Wed 2021-03-31 21:03:56 UTC; 8min ago Main PID: 2320872 (ipa-otpd) Tasks: 1 (limit: 24981) Memory: 2.2M CGroup: /system.slice/system-ipa\x2dotpd.slice/ipa-otpd@61-848957-0.service └─2320872 /usr/libexec/ipa/ipa-otpd ldapi://%2fvar%2frun%2fslapd-AGDSN-DE.socket Mar 31 21:03:56 myhost systemd[1]: Started ipa-otpd service (PID 848957/UID 0). Mar 31 21:03:56 myhost ipa-otpd[2320872]: LDAP: ldapi://%2fvar%2frun%2fslapd-AGDSN-DE.socket Mar 31 21:03:56 myhost ipa-otpd[2320872]: lukasjuhrich@AGDSN.DE: request received Mar 31 21:03:56 myhost ipa-otpd[2320872]: lukasjuhrich@AGDSN.DE: user query start Mar 31 21:03:56 myhost ipa-otpd[2320872]: lukasjuhrich@AGDSN.DE: user query end: uid=lukasjuhrich,cn=users,cn=accounts,dc=agdsn,dc=de Mar 31 21:03:56 myhost ipa-otpd[2320872]: lukasjuhrich@AGDSN.DE: bind start: uid=lukasjuhrich,cn=users,cn=accounts,dc=agdsn,dc=de Mar 31 21:03:56 myhost ipa-otpd[2320872]: lukasjuhrich@AGDSN.DE: bind end: success Mar 31 21:03:56 myhost ipa-otpd[2320872]: lukasjuhrich@AGDSN.DE: response sent: Access-Accept [root@idm0 2320872]# sd restart dirsrv@AGDSN-DE.service [root@idm0 2320872]# sd status ipa-otpd@61-848957-0.service ● ipa-otpd@61-848957-0.service - ipa-otpd service (PID 848957/UID 0) Loaded: loaded (/usr/lib/systemd/system/ipa-otpd@.service; static; vendor preset: disabled) Active: failed (Result: exit-code) since Wed 2021-03-31 21:12:31 UTC; 6s ago Process: 2320872 ExecStart=/usr/libexec/ipa/ipa-otpd $ldap_uri (code=exited, status=1/FAILURE) Main PID: 2320872 (code=exited, status=1/FAILURE) Mar 31 21:03:56 myhost ipa-otpd[2320872]: LDAP: ldapi://%2fvar%2frun%2fslapd-AGDSN-DE.socket Mar 31 21:03:56 myhost ipa-otpd[2320872]: lukasjuhrich@AGDSN.DE: request received Mar 31 21:03:56 myhost ipa-otpd[2320872]: lukasjuhrich@AGDSN.DE: user query start Mar 31 21:03:56 myhost ipa-otpd[2320872]: lukasjuhrich@AGDSN.DE: user query end: uid=lukasjuhrich,cn=users,cn=accounts,dc=agdsn,dc=de Mar 31 21:03:56 myhost ipa-otpd[2320872]: lukasjuhrich@AGDSN.DE: bind start: uid=lukasjuhrich,cn=users,cn=accounts,dc=agdsn,dc=de Mar 31 21:03:56 myhost ipa-otpd[2320872]: lukasjuhrich@AGDSN.DE: bind end: success Mar 31 21:03:56 myhost ipa-otpd[2320872]: lukasjuhrich@AGDSN.DE: response sent: Access-Accept Mar 31 21:12:31 myhost ipa-otpd[2320872]: query.c:137: Input/output error: IO error received on query socket Mar 31 21:12:31 myhost systemd[1]: ipa-otpd@61-848957-0.service: Main process exited, code=exited, status=1/FAILURE Mar 31 21:12:31 myhost systemd[1]: ipa-otpd@61-848957-0.service: Failed with result 'exit-code'.
Yep! It's the idle timeout! indeed:
[root@idm0 slapd-AGDSN-DE]# ldapsearch -LLo ldif-wrap=no -D "cn=directory manager" -W -b 'cn=config' -s base nsslapd-idletimeout Enter LDAP Password: version: 1 dn: cn=config nsslapd-idletimeout: 3600
…and after - resetting all failed units - setting nsslapd-idletimeout to 60 - restarting dirsrv@blah - re-initiating the OTP auth - waiting 60s
nsslapd-idletimeout
60
dirsrv@blah
I get the same log with a 60s interval in the relevant lines:
[root@idm0 slapd-AGDSN-DE]# sd status ipa-otpd@62-848960-0.service ● ipa-otpd@62-848960-0.service - ipa-otpd service (PID 848960/UID 0) Loaded: loaded (/usr/lib/systemd/system/ipa-otpd@.service; static; vendor preset: disabled) Active: failed (Result: exit-code) since Wed 2021-03-31 21:29:58 UTC; 1min 1s ago Process: 2322932 ExecStart=/usr/libexec/ipa/ipa-otpd $ldap_uri (code=exited, status=1/FAILURE) Main PID: 2322932 (code=exited, status=1/FAILURE) Mar 31 21:28:58 myhost ipa-otpd[2322932]: LDAP: ldapi://%2fvar%2frun%2fslapd-AGDSN-DE.socket Mar 31 21:28:58 myhost ipa-otpd[2322932]: lukasjuhrich@AGDSN.DE: request received Mar 31 21:28:58 myhost ipa-otpd[2322932]: lukasjuhrich@AGDSN.DE: user query start Mar 31 21:28:58 myhost ipa-otpd[2322932]: lukasjuhrich@AGDSN.DE: user query end: uid=lukasjuhrich,cn=users,cn=accounts,dc=agdsn,dc=de Mar 31 21:28:58 myhost ipa-otpd[2322932]: lukasjuhrich@AGDSN.DE: bind start: uid=lukasjuhrich,cn=users,cn=accounts,dc=agdsn,dc=de Mar 31 21:28:58 myhost ipa-otpd[2322932]: lukasjuhrich@AGDSN.DE: bind end: success Mar 31 21:28:58 myhost ipa-otpd[2322932]: lukasjuhrich@AGDSN.DE: response sent: Access-Accept Mar 31 21:29:58 myhost ipa-otpd[2322932]: bind.c:088: Input/output error: IO error received on bind socket Mar 31 21:29:58 myhost systemd[1]: ipa-otpd@62-848960-0.service: Main process exited, code=exited, status=1/FAILURE Mar 31 21:29:58 myhost systemd[1]: ipa-otpd@62-848960-0.service: Failed with result 'exit-code'.
So… is the solution here to exit 0? Or am I missing context as to why that might not be desirable
exit 0
master:
ipa-4-6:
ipa-4-8:
ipa-4-9:
Metadata Update from @frenaud: - Issue close_status updated to: fixed - Issue status updated to: Closed (was: Open)
Login to comment on this ticket.