#1655 Login fails - sssd_be module polling fd indefinitely and gets killed
Closed: Fixed None Opened 7 years ago by endzone.

Info

We are running SSSD 1.8.3 on openSUSE 12.2 using KRB5 and LDAP. The LDAP access is also kerberized, i.e., SSSD uses GSSAPI to authenticate to the LDAP server (see sssd.conf attached).

Problem description

After returning from a lunch break, a user was not able to unlock the KDE screensaver at around 13:02. Entering user (iscan) password and confirming simply did not do anything, i.e. he was prompted again by KDE (multiple trials between 13:06 - 13:08).

Some diagnosis

1) PID and starting time of sssd_be

> ps auxf | grep sss:

root       942  0.0  0.0  78372  1624 ?        Ss   Nov15   0:13 /usr/sbin/sssd -D -f
root       945  0.0  0.0  66876  2192 ?        S    Nov15   0:06  \_ /usr/lib/sssd/sssd_nss --debug-to-files
root       946  0.0  0.0  66700  2036 ?        S    Nov15   0:05  \_ /usr/lib/sssd/sssd_pam --debug-to-files
root     28586  0.0  0.2 149332  8528 ?        S    13:03   0:00  \_ /usr/lib/sssd/sssd_be --domain default --debug-to-files

The PID and starting time of sssd_be suggest, that it was restarted just after the user tried to login. Possibly after a crash? All the other sssh binaried started days before on system boot. Looking at

grep sssd /var/log/messages


...
Nov 19 13:03:35 st-iscan sssd[be[default]]: Starting up
...
Nov 19 13:06:53 st-iscan kcheckpass[28512]: pam_sss(xdm:auth): Request to sssd failed. Timer expired
Nov 19 13:06:53 st-iscan kcheckpass[28512]: Authentication failure for iscan (invoked by uid 10052)
...
Nov 19 13:07:28 st-iscan kcheckpass[28520]: pam_sss(xdm:auth): Request to sssd failed. Timer expired
Nov 19 13:07:28 st-iscan kcheckpass[28520]: Authentication failure for iscan (invoked by uid 10052)
...
Nov 19 13:07:49 st-iscan kcheckpass[28521]: pam_sss(xdm:auth): Request to sssd failed. Timer expired
Nov 19 13:07:49 st-iscan kcheckpass[28521]: Authentication failure for iscan (invoked by uid 10052)
...
Nov 19 13:08:09 st-iscan kcheckpass[28582]: pam_sss(xdm:auth): Request to sssd failed. Timer expired
Nov 19 13:08:09 st-iscan kcheckpass[28582]: Authentication failure for iscan (invoked by uid 10052)
...
Nov 19 13:08:32 st-iscan kcheckpass[28585]: pam_sss(xdm:auth): Request to sssd failed. Timer expired
Nov 19 13:08:32 st-iscan kcheckpass[28585]: Authentication failure for iscan (invoked by uid 10052)

as well as

grep default /var/log/sssd/sssd.log


(Mon Nov 19 13:03:35 2012) [sssd] [mt_svc_sigkill] (0x0010): [default][943] is not responding to SIGTERM. Sending SIGKILL.

seems to confirm this.

SSSD Logfiles

  1. /var/log/sssd/sssd.log, see above\
  2. /var/log/sssd/krb5_child.log, nothing really around the mentioned time\
  3. /var/log/sssd/ldap_child.log looks normal, child acquires Kerberos principal from keytab:\
  4. /var/log/sssd/sssd_default.log is interesting.

Observations from sssd_deafult.log

  1. There is sudden break after "Executing sasl bind mech: GSSAPI, user: (null)" at 13:01:53 suggest that this module crashed and then is restartet ad 13:03.
  2. Still, login is not possible after this, as all the KDE login trials between 13:06 and 13:08 with user iscan don't even show up in here.

Summary

Login at 13:01:53 fails as sssd_default seems to crash when performing LDAP search. Module sssd_default restarts, but login still not possible, maybe due to broken connection to sssd/ldap/krb module after restart (Timer expired error around 13:06-13:08)?

Solution

Restart SSSD service, KDE login then succeeds.


Remark: by module "sssd_default" I actually mean "sssd_be"...

The problem just happened again and I had an strace running on "sssd_be". As I am not sure, if the strace contains sensitive material like passwords or Kerberos tokens, I do not want to post it here entirely, but just a snippet.

The IP 192.168.109.3 is our Kerberos KDC, last logfile entry in "sssd_default.log" is at 15:32:30, saying "Executing sasl bind mech: GSSAPI, user: (null)", then, a little while later, "sssd_be" gets killed by "sssd" as it is not responding.

From the strace, one can see that "sssd_be" gets caught in a loop somehow, polling fd 23 indefinitely:

... Last logfile entry:
30668 write(13, "(Mon Nov 19 15:32:30 2012) [sssd[be[default]]] [sdap_get_tgt_recv] (0x0400): ", 77) = 77
30668 write(13, "Child responded: 0 [FILE:/var/lib/sss/db/ccache_LNT.EI.TUM.DE], expired on [1353421950]\n", 88) = 88
30668 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2309, ...}) = 0
30668 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2309, ...}) = 0
30668 write(13, "(Mon Nov 19 15:32:30 2012) [sssd[be[default]]] [sasl_bind_send] (0x0100): ", 74) = 74
30668 write(13, "Executing sasl bind mech: GSSAPI, user: (null)\n", 47) = 47
30668 uname({sys="Linux", node="st-iscan", ...}) = 0
... lots of Kerberos processing stripped ...
30668 socket(PF_INET, SOCK_DGRAM, IPPROTO_IP) = 23
30668 fcntl(23, F_SETFD, FD_CLOEXEC)    = 0
30668 ioctl(23, FIONBIO, [1])           = 0
30668 connect(23, {sa_family=AF_INET, sin_port=htons(88), sin_addr=inet_addr("192.168.109.3")}, 16) = 0
30668 sendto(23, "<KRB DATA STRIPPED>"..., 696, 0, NULL, 0) = 696
30668 poll([{fd=23, events=POLLIN}], 1, 1000) = 0 (Timeout)
30668 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 24
30668 fcntl(24, F_SETFD, FD_CLOEXEC)    = 0
30668 ioctl(24, FIONBIO, [1])           = 0
30668 setsockopt(24, SOL_SOCKET, SO_LINGER, {onoff=0, linger=0}, 8) = 0
30668 connect(24, {sa_family=AF_INET, sin_port=htons(88), sin_addr=inet_addr("192.168.109.3")}, 16) = -1 EINPROGRESS (Operation now in progress)
30668 poll([{fd=23, events=POLLIN}, {fd=24, events=POLLIN|POLLOUT}], 2, 1000) = 1 ([{fd=24, revents=POLLIN|POLLOUT|POLLERR|POLLHUP}])
30668 close(24)                         = 0
30668 close(-1)                         = -1 EBADF (Bad file descriptor)
30668 poll([{fd=23, events=POLLIN}], 1, 1000) = 1 ([{fd=23, revents=POLLIN}])
30668 poll([{fd=23, events=POLLIN}], 1, 74) = 1 ([{fd=23, revents=POLLIN}])
30668 poll([{fd=23, events=POLLIN}], 1, 74) = 1 ([{fd=23, revents=POLLIN}])
30668 poll([{fd=23, events=POLLIN}], 1, 74) = 1 ([{fd=23, revents=POLLIN}])
30668 poll([{fd=23, events=POLLIN}], 1, 74) = 1 ([{fd=23, revents=POLLIN}])
30668 poll([{fd=23, events=POLLIN}], 1, 74) = 1 ([{fd=23, revents=POLLIN}])
30668 poll([{fd=23, events=POLLIN}], 1, 74) = 1 ([{fd=23, revents=POLLIN}])
30668 poll([{fd=23, events=POLLIN}], 1, 74) = 1 ([{fd=23, revents=POLLIN}])
30668 poll([{fd=23, events=POLLIN}], 1, 74) = 1 ([{fd=23, revents=POLLIN}])
30668 poll([{fd=23, events=POLLIN}], 1, 74) = 1 ([{fd=23, revents=POLLIN}])
30668 poll([{fd=23, events=POLLIN}], 1, 74) = 1 ([{fd=23, revents=POLLIN}])
30668 poll([{fd=23, events=POLLIN}], 1, 73) = 1 ([{fd=23, revents=POLLIN}])
30668 poll([{fd=23, events=POLLIN}], 1, 73) = 1 ([{fd=23, revents=POLLIN}])
30668 poll([{fd=23, events=POLLIN}], 1, 73) = 1 ([{fd=23, revents=POLLIN}])
30668 poll([{fd=23, events=POLLIN}], 1, 73) = 1 ([{fd=23, revents=POLLIN}])
30668 poll([{fd=23, events=POLLIN}], 1, 73) = 1 ([{fd=23, revents=POLLIN}])
... counting down from 73 to negative numbers until ...
30668 poll([{fd=23, events=POLLIN}], 1, -107791) = 1 ([{fd=23, revents=POLLIN}])
30668 poll([{fd=23, events=POLLIN}], 1, -107791) = 1 ([{fd=23, revents=POLLIN}])
30668 poll([{fd=23, events=POLLIN}], 1, -107791) = 1 ([{fd=23, revents=POLLIN}])
30668 poll([{fd=23, events=POLLIN}], 1, -107791) = 1 ([{fd=23, revents=POLLIN}])
30668 poll([{fd=23, events=POLLIN}], 1, -107791) = 1 ([{fd=23, revents=POLLIN}])
30668 poll([{fd=23, events=POLLIN}], 1, -107792) = 1 ([{fd=23, revents=POLLIN}])
30668 poll([{fd=23, events=POLLIN}], 1, -107792) = 1 ([{fd=23, revents=POLLIN}])
30668 poll([{fd=23, events=POLLIN}], 1, -107792) = 1 ([{fd=23, revents=POLLIN}])
30668 poll([{fd=23, events=POLLIN}], 1, -107792) = 1 ([{fd=23, revents=POLLIN}])
30668 poll([{fd=23, events=POLLIN}], 1, -107792) = 1 ([{fd=23, revents=POLLIN}])
30668 poll([{fd=23, events=POLLIN}], 1, -107792) = 1 ([{fd=23, revents=POLLIN}])
30668 poll([{fd=23, events=POLLIN}], 1, -107792) = 1 ([{fd=23, revents=POLLIN}])
30668 poll([{fd=23, events=POLLIN}], 1, -107792) = 1 ([{fd=23, revents=POLLIN}])
30668 poll([{fd=23, events=POLLIN}], 1, -107792) = 1 ([{fd=23, revents=POLLIN}])
30668 poll([{fd=23, events=POLLIN}], 1, -107792) = 1 ([{fd=23, revents=POLLIN}])
30668 poll([{fd=23, events=POLLIN}], 1, -107792) = 1 ([{fd=23, revents=POLLIN}])
30668 poll([{fd=23, events=POLLIN}], 1, -107792) = 1 ([{fd=23, revents=POLLIN}])
30668 poll([{fd=23, events=POLLIN}], 1, -107792) = 1 ([{fd=23, revents=POLLIN}])
30668 poll([{fd=23, events=POLLIN}], 1, -107792) = 1 ([{fd=23, revents=POLLIN}])
30668 poll([{fd=23, events=POLLIN}], 1, -107792) = 1 ([{fd=23, revents=POLLIN}])
30668 poll([{fd=23, events=POLLIN}], 1, -107792) = 1 ([{fd=23, revents=POLLIN}])
30668 poll([{fd=23, events=POLLIN}], 1, -107792) = 1 ([{fd=23, revents=POLLIN}])
30668 poll([{fd=23, events=POLLIN}], 1, -107792) = 1 ([{fd=23, revents=POLLIN}])
30668 poll([{fd=23, events=POLLIN}], 1, -107792) = 1 ([{fd=23, revents=POLLIN}])
30668 poll([{fd=23, events=POLLIN}], 1, -107792) = 1 ([{fd=23, revents=POLLIN}])
30668 poll([{fd=23, events=POLLIN}], 1, -107792) = 1 ([{fd=23, revents=POLLIN}])
30668 poll([{fd=23, events=POLLIN}], 1, -107792) = 1 ([{fd=23, revents=POLLIN}])
30668 poll([{fd=23, events=POLLIN}], 1, -107792) = 1 ([{fd=23, revents=POLLIN}])
30668 poll([{fd=23, events=POLLIN}], 1, -107792) = 1 ([{fd=23, revents=POLLIN}])
30668 poll([{fd=23, events=POLLIN}], 1, -107792) = 1 ([{fd=23, revents=POLLIN}])
30668 poll([{fd=23, events=POLLIN}], 1, -107792) = 1 ([{fd=23, revents=POLLIN}])
30668 poll([{fd=23, events=POLLIN}], 1, -107792) = 1 ([{fd=23, revents=POLLIN}])
30668 poll([{fd=23, events=POLLIN}], 1, -107792) = 1 ([{fd=23, revents=POLLIN}])
30668 poll([{fd=23, events=POLLIN}], 1, -107792) = 1 ([{fd=23, revents=POLLIN}])
30668 poll([{fd=23, events=POLLIN}], 1, -107792) = 1 ([{fd=23, revents=POLLIN}])
30668 poll([{fd=23, events=POLLIN}], 1, -107792 <ptrace(SYSCALL):No such process>
30668 +++ killed by SIGKILL +++

summary: Login fails after some idle time - default module crashed? => Login fails - sssd_be module polling fd indefinitely and gets killed

So there seems to be two problems here:\
1) sssd_be gets caught in a loop and is not responding while talking to the KDC,\
2) sssd_be gets killed and restarts ok, but login then is still not possible from KDE (with Timer expired error). Note that a SSH login (with a different username) is working though.

Thank you for the bug report. We've seen a report like this in the past but we haven't been able to gather enough debugging information. Your report is definitely useful.

Problem 2) is far more concerning - if we can't recover from a crash, then we might leave the user locked out (which is basically what happened to you).

Is the bug reproducable in any way?

Well, no, not exactly so far. I tried to simulate a hang of sssd_be by issuing a "pkill --signal STOP sssd_be". Do so, the pings to sssd_be fail and it gets killed and restarted after about 5+ ping trials.

If I try to login to the box while sssd_be is in STOP state, nothing happens (obviously). But also, I cannot login, once sssd_be restarted. I was not able to reproduce this 100%, I think the key part is to do a login trial while sssd_be is not responding.

This situation comes very close to the original problem, but I do not get the Timer expired error message in /var/log/messages. Hence, it's not 100% the identical to the problem #2 described above.

Steps:
1. pkill --signal STOP sssd_be
2. Try to log in, nothing happens (expected). In sssd_pam.log, you can see an authentication request getting sent.
3. Wait for sssd_be getting killed and restarted.
4. Try to log in again, still not working. In sssd_pam.log, it says that 1) it successfully reconnected to the data provider, but 2) an identical authentication request already exists.
5. I waited 2-3 more minutes and the restarted sssd to make login work again.

Just to mention another problem that is caused most likely by this bug :-)

Once sssd_be stops working correctly, bash-completion for files and directories (even as root user) hangs! That is, pressing the TAB key causes bash to hang.

The reason seems to be, that bash queries NSS (libnss_sss.so ) about uid/gid information, but does not get response.

Bug appeared again this Monday morning (after machine idling over the weekend).

No new insight on problem 1) why sssd_be is crashing exactly, but the issue with the bug only appearing after some idle time suggests that it could be Kerberos related (i.e. maybe TGT running out and crashing while getting it again)?

About problem 2) I can confirm that user is locked out! The actual bug looks mostly identical to the simulated case with "pkill --signal STOP sssd_be" from above, so that might be a good starting point for debugging. I can see that sssd_be restarted after crash, but no login possible.

I did some strace on the kscreenlocker and agetty, which both are unable to login the user. Both send a request to sssd_pam, which logs "Identical request in progress" and then does nothing. Login is working again after a kill/restart sssd_pam!

Link to the right BZ when identified.

milestone: NEEDS_TRIAGE => SSSD 1.9.4

Is there anything further I can do to help?
Currently I am unable to further diagnose the reason for the crash (problem 1), but the user lockout (problem 2) seems to be reproducible by the steps in comment 5.

Possibly, problem 1 and 2 could be split into two tickets to investigate them independently...?

Sorry for the delay, we've just released 1.9.3 yesterday. This bug is one of the highest priorities in the 1.9.4 cycle, so we'll be looking into it in the next couple of days.

May be same as #1564

rhbz: => 0

I think I found the reason for problem 1), the indefinite polling shown in the strace.

It is most likely bug in krb5-10.x, causing an indefinite loop during KDC communications: https://github.com/krb5/krb5/commit/2b06a22f7fd8ec01fb27a7335125290b8ceb6f18

I compiled a test package with this commit for openSUSE12.2, started testing and will report back. I am confident that problem 1 will be solved by this.

Problem 2) then still open!

I tried to reproduce by using SIGSTOP, but as soon as sssd_be is killed I can login again.
I tried using ssh, will try some other programs, I do not have the KDS screensaver handy though

Ok I reproduced with the gnome screensaver.
I know exactly what is going on now, working on a fix.

Fields changed

owner: somebody => simo
patch: 0 => 1

resolution: => fixed
status: new => closed

Metadata Update from @endzone:
- Issue assigned to simo
- Issue set to the milestone: SSSD 1.9.4

2 years ago

Login to comment on this ticket.

Metadata