Issue #827: Something breaks mysteriously after a couple of weeks of running - sssd

SSSD / sssd

#827 Something breaks mysteriously after a couple of weeks of running

Closed: Fixed None Opened 13 years ago by twhitehead.

We have an LDAP based setup using sssd for nss and pam on Fedora 14. After an apparently random number of days, sssd just stops returning entries until it is restarted.

For example, currently on one of our machines, I get

$ getent passwd tyson
$

while on another identically configured machine, I get

$ getent passwd tyson
tyson:*:130018:130018:Tyson Whitehead:/home/tyson:/bin/bash
$

This is the third our fourth time in last two or three months that this has happened, so I've turned the verbosity all the way up in logs. The first machines's sssd_nss.log says

(Mon Mar 14 14:26:25 2011) [sssd[nss]] [accept_fd_handler] (4): Client connected!
(Mon Mar 14 14:26:25 2011) [sssd[nss]] [sss_cmd_get_version] (5): Received client version [1].
(Mon Mar 14 14:26:25 2011) [sssd[nss]] [sss_cmd_get_version] (5): Offered version [1].
(Mon Mar 14 14:26:25 2011) [sssd[nss]] [nss_cmd_getpwnam] (4): Requesting info for [tyson] from [<ALL>]
(Mon Mar 14 14:26:25 2011) [sssd[nss]] [nss_cmd_getpwnam_search] (4): Requesting info for [tyson@LDAP]
(Mon Mar 14 14:26:25 2011) [sssd[nss]] [ldb] (9): tevent: Added timed event "ltdb_callback": 0x110b2a0

(Mon Mar 14 14:26:25 2011) [sssd[nss]] [ldb] (9): tevent: Added timed event "ltdb_timeout": 0x110b390

(Mon Mar 14 14:26:25 2011) [sssd[nss]] [ldb] (9): tevent: Destroying timer event 0x110b390 "ltdb_timeout"

(Mon Mar 14 14:26:25 2011) [sssd[nss]] [ldb] (9): tevent: Ending timer event 0x110b2a0 "ltdb_callback"

(Mon Mar 14 14:26:25 2011) [sssd[nss]] [sss_dp_send_acct_req_create] (4): Sending request for [LDAP][4097][1][name=tyson]
(Mon Mar 14 14:26:25 2011) [sssd[nss]] [sbus_add_timeout] (8): 0x10daf40
(Mon Mar 14 14:26:25 2011) [sssd[nss]] [sbus_remove_timeout] (8): 0x10daf40
(Mon Mar 14 14:26:25 2011) [sssd[nss]] [sbus_dispatch] (9): dbus conn: 10C15C0
(Mon Mar 14 14:26:25 2011) [sssd[nss]] [sbus_dispatch] (9): Dispatching.
(Mon Mar 14 14:26:25 2011) [sssd[nss]] [sss_dp_get_reply] (4): Got reply (0, 0, Success) from Data Provider
(Mon Mar 14 14:26:25 2011) [sssd[nss]] [nss_cmd_getpwnam_search] (4): Requesting info for [tyson@LDAP]
(Mon Mar 14 14:26:25 2011) [sssd[nss]] [ldb] (9): tevent: Added timed event "ltdb_callback": 0x110b7f0

(Mon Mar 14 14:26:25 2011) [sssd[nss]] [ldb] (9): tevent: Added timed event "ltdb_timeout": 0x110b8a0

(Mon Mar 14 14:26:25 2011) [sssd[nss]] [ldb] (9): tevent: Destroying timer event 0x110b8a0 "ltdb_timeout"

(Mon Mar 14 14:26:25 2011) [sssd[nss]] [ldb] (9): tevent: Ending timer event 0x110b7f0 "ltdb_callback"

(Mon Mar 14 14:26:25 2011) [sssd[nss]] [nss_cmd_getpwnam_search] (2): No matching domain found for [tyson], fail!
(Mon Mar 14 14:26:26 2011) [sssd[nss]] [client_recv] (5): Client disconnected!

The second machines sssd_nss.log says

(Mon Mar 14 14:30:13 2011) [sssd[nss]] [accept_fd_handler] (4): Client connected!
(Mon Mar 14 14:30:13 2011) [sssd[nss]] [sss_cmd_get_version] (5): Received client version [1].
(Mon Mar 14 14:30:13 2011) [sssd[nss]] [sss_cmd_get_version] (5): Offered version [1].
(Mon Mar 14 14:30:13 2011) [sssd[nss]] [nss_cmd_getpwnam] (4): Requesting info for [tyson] from [<ALL>]
(Mon Mar 14 14:30:13 2011) [sssd[nss]] [nss_cmd_getpwnam_search] (4): Requesting info for [tyson@LDAP]
(Mon Mar 14 14:30:13 2011) [sssd[nss]] [ldb] (9): tevent: Added timed event "ltdb_callback": 0x22ee620

(Mon Mar 14 14:30:13 2011) [sssd[nss]] [ldb] (9): tevent: Added timed event "ltdb_timeout": 0x22ee740

(Mon Mar 14 14:30:13 2011) [sssd[nss]] [ldb] (9): tevent: Destroying timer event 0x22ee740 "ltdb_timeout"

(Mon Mar 14 14:30:13 2011) [sssd[nss]] [ldb] (9): tevent: Ending timer event 0x22ee620 "ltdb_callback"

(Mon Mar 14 14:30:13 2011) [sssd[nss]] [sss_dp_send_acct_req_create] (4): Sending request for [LDAP][4097][1][name=tyson]
(Mon Mar 14 14:30:13 2011) [sssd[nss]] [sbus_add_timeout] (8): 0x22eda40
(Mon Mar 14 14:30:13 2011) [sssd[nss]] [sbus_remove_timeout] (8): 0x22eda40
(Mon Mar 14 14:30:13 2011) [sssd[nss]] [sbus_dispatch] (9): dbus conn: 22C3F40
(Mon Mar 14 14:30:13 2011) [sssd[nss]] [sbus_dispatch] (9): Dispatching.
(Mon Mar 14 14:30:13 2011) [sssd[nss]] [sss_dp_get_reply] (4): Got reply (0, 0, Success) from Data Provider
(Mon Mar 14 14:30:13 2011) [sssd[nss]] [nss_cmd_getpwnam_search] (4): Requesting info for [tyson@LDAP]
(Mon Mar 14 14:30:13 2011) [sssd[nss]] [ldb] (9): tevent: Added timed event "ltdb_callback": 0x22ee740

(Mon Mar 14 14:30:13 2011) [sssd[nss]] [ldb] (9): tevent: Added timed event "ltdb_timeout": 0x22ee580

(Mon Mar 14 14:30:13 2011) [sssd[nss]] [ldb] (9): tevent: Destroying timer event 0x22ee580 "ltdb_timeout"

(Mon Mar 14 14:30:13 2011) [sssd[nss]] [ldb] (9): tevent: Ending timer event 0x22ee740 "ltdb_callback"

(Mon Mar 14 14:30:13 2011) [sssd[nss]] [nss_cmd_getpwnam_search] (6): Returning info for user [tyson@LDAP]
(Mon Mar 14 14:30:13 2011) [sssd[nss]] [client_recv] (5): Client disconnected!

The only apparent difference between these is the last line. Running tcpdump and/or stracing the process verifies that both machines are still talking to the LDAP server. It is over SSL though, so I can't tell what they are saying.

Before, when this has happened, it always seemed as if the services have stopped working for all users, but this time I'm noticing a variety of reponses. For example

$ getent passwd gary
gary:*:130000:130000:Gary Molenkamp:/home/gary:/bin/bash
$ getent passwd bge
$

While on the second machine

$ getent passwd gary
gary:*:130000:130000:Gary Molenkamp:/home/gary:/bin/bash
$ getent passwd bge
bge:*:130127:130127:Baolai Ge:/home/bge:/bin/bash
$

Interestingly enough, the logs shows a different reasons for user bge failing than user tyson

(Mon Mar 14 17:46:10 2011) [sssd[nss]] [accept_fd_handler] (4): Client connected!
(Mon Mar 14 17:46:10 2011) [sssd[nss]] [sss_cmd_get_version] (5): Received client version [1].
(Mon Mar 14 17:46:10 2011) [sssd[nss]] [sss_cmd_get_version] (5): Offered version [1].
(Mon Mar 14 17:46:10 2011) [sssd[nss]] [nss_cmd_getpwnam] (4): Requesting info for [bge] from [<ALL>]
(Mon Mar 14 17:46:10 2011) [sssd[nss]] [nss_cmd_getpwnam_search] (4): Requesting info for [bge@LDAP]
(Mon Mar 14 17:46:10 2011) [sssd[nss]] [ldb] (9): tevent: Added timed event "ltdb_callback": 0x1101f00

(Mon Mar 14 17:46:10 2011) [sssd[nss]] [ldb] (9): tevent: Added timed event "ltdb_timeout": 0x11021e0

(Mon Mar 14 17:46:10 2011) [sssd[nss]] [ldb] (9): tevent: Destroying timer event 0x11021e0 "ltdb_timeout"

(Mon Mar 14 17:46:10 2011) [sssd[nss]] [ldb] (9): tevent: Ending timer event 0x1101f00 "ltdb_callback"

(Mon Mar 14 17:46:10 2011) [sssd[nss]] [sss_dp_send_acct_req_create] (4): Sending request for [LDAP][4097][1][name=bge]
(Mon Mar 14 17:46:10 2011) [sssd[nss]] [sbus_add_timeout] (8): 0x10e1f40
(Mon Mar 14 17:46:10 2011) [sssd[nss]] [sbus_remove_timeout] (8): 0x10e1f40
(Mon Mar 14 17:46:10 2011) [sssd[nss]] [sbus_dispatch] (9): dbus conn: 10C15C0
(Mon Mar 14 17:46:10 2011) [sssd[nss]] [sbus_dispatch] (9): Dispatching.
(Mon Mar 14 17:46:10 2011) [sssd[nss]] [sss_dp_get_reply] (4): Got reply (0, 0, Success) from Data Provider
(Mon Mar 14 17:46:10 2011) [sssd[nss]] [nss_cmd_getpwnam_search] (4): Requesting info for [bge@LDAP]
(Mon Mar 14 17:46:10 2011) [sssd[nss]] [ldb] (9): tevent: Added timed event "ltdb_callback": 0x11021e0

(Mon Mar 14 17:46:10 2011) [sssd[nss]] [ldb] (9): tevent: Added timed event "ltdb_timeout": 0x11222d0

(Mon Mar 14 17:46:10 2011) [sssd[nss]] [ldb] (9): tevent: Destroying timer event 0x11222d0 "ltdb_timeout"

(Mon Mar 14 17:46:10 2011) [sssd[nss]] [ldb] (9): tevent: Ending timer event 0x11021e0 "ltdb_callback"

(Mon Mar 14 17:46:10 2011) [sssd[nss]] [nss_cmd_getpwnam_search] (6): Returning info for user [bge@LDAP]
(Mon Mar 14 17:46:10 2011) [sssd[nss]] [fill_pwent] (2): Incomplete or fake user object for bge[0]! Skipping
(Mon Mar 14 17:46:10 2011) [sssd[nss]] [client_recv] (5): Client disconnected!

The logs on the second machine show everything is fine

(Mon Mar 14 17:47:47 2011) [sssd[nss]] [accept_fd_handler] (4): Client connected!
(Mon Mar 14 17:47:47 2011) [sssd[nss]] [sss_cmd_get_version] (5): Received client version [1].
(Mon Mar 14 17:47:47 2011) [sssd[nss]] [sss_cmd_get_version] (5): Offered version [1].
(Mon Mar 14 17:47:47 2011) [sssd[nss]] [nss_cmd_getpwnam] (4): Requesting info for [bge] from [<ALL>]
(Mon Mar 14 17:47:47 2011) [sssd[nss]] [nss_cmd_getpwnam_search] (4): Requesting info for [bge@LDAP]
(Mon Mar 14 17:47:47 2011) [sssd[nss]] [ldb] (9): tevent: Added timed event "ltdb_callback": 0x22f58b0

(Mon Mar 14 17:47:47 2011) [sssd[nss]] [ldb] (9): tevent: Added timed event "ltdb_timeout": 0x22e8e80

(Mon Mar 14 17:47:47 2011) [sssd[nss]] [ldb] (9): tevent: Destroying timer event 0x22e8e80 "ltdb_timeout"

(Mon Mar 14 17:47:47 2011) [sssd[nss]] [ldb] (9): tevent: Ending timer event 0x22f58b0 "ltdb_callback"

(Mon Mar 14 17:47:47 2011) [sssd[nss]] [sss_dp_send_acct_req_create] (4): Sending request for [LDAP][4097][1][name=bge]
(Mon Mar 14 17:47:47 2011) [sssd[nss]] [sbus_add_timeout] (8): 0x22f5910
(Mon Mar 14 17:47:48 2011) [sssd[nss]] [sbus_dispatch] (9): dbus conn: 22C1C10
(Mon Mar 14 17:47:48 2011) [sssd[nss]] [sbus_dispatch] (9): Dispatching.
(Mon Mar 14 17:47:48 2011) [sssd[nss]] [sbus_message_handler] (9): Received SBUS method [ping]
(Mon Mar 14 17:47:48 2011) [sssd[nss]] [sbus_remove_timeout] (8): 0x22f5910
(Mon Mar 14 17:47:48 2011) [sssd[nss]] [sbus_dispatch] (9): dbus conn: 22C3F40
(Mon Mar 14 17:47:48 2011) [sssd[nss]] [sbus_dispatch] (9): Dispatching.
(Mon Mar 14 17:47:48 2011) [sssd[nss]] [sss_dp_get_reply] (4): Got reply (0, 0, Success) from Data Provider
(Mon Mar 14 17:47:48 2011) [sssd[nss]] [nss_cmd_getpwnam_search] (4): Requesting info for [bge@LDAP]
(Mon Mar 14 17:47:48 2011) [sssd[nss]] [ldb] (9): tevent: Added timed event "ltdb_callback": 0x22e8e80

(Mon Mar 14 17:47:48 2011) [sssd[nss]] [ldb] (9): tevent: Added timed event "ltdb_timeout": 0x22ed070

(Mon Mar 14 17:47:48 2011) [sssd[nss]] [ldb] (9): tevent: Destroying timer event 0x22ed070 "ltdb_timeout"

(Mon Mar 14 17:47:48 2011) [sssd[nss]] [ldb] (9): tevent: Ending timer event 0x22e8e80 "ltdb_callback"

(Mon Mar 14 17:47:48 2011) [sssd[nss]] [nss_cmd_getpwnam_search] (6): Returning info for user [bge@LDAP]
(Mon Mar 14 17:47:48 2011) [sssd[nss]] [client_recv] (5): Client disconnected!

The non-commented out ('#' or ';') lines in our (identical) configuration files are

[sssd]
config_file_version = 2
reconnection_retries = 3
sbus_timeout = 30
services = nss, pam
domains = LDAP

[nss]
debug_level = 9
filter_groups = root
filter_users = root
reconnection_retries = 3

[pam]
debug_level = 9
reconnection_retries = 3

[domain/LDAP]
id_provider = ldap
auth_provider = ldap
ldap_search_base = dc=sharcnet,dc=ca
ldap_uri = ldaps://ns1.sharcnet.ca
ldap_tls_cipher_suite = HIGH
ldap_tls_cacert = /etc/pki/tls/certs/viz-ldap-cacert.pem
ldap_tls_cert = /etc/pki/tls/private/viz-ldap-cert.pem
ldap_tls_key = /etc/pki/tls/private/viz-ldap-key.pem

Thanks! -Tyson

PS: I googled the log messages a bit, and the nss_cmd_getpwnam_search code came up. It seems quite strange that the line "[nss_cmd_getpwnam_search] (4): Requesting info for [XXX@LDAP]" is logged twice as it looks like it should only happen once.

jhrozek commented 13 years ago

Can you also add debug_level = 9 to [domain/LDAP] and then attach /var/log/sss/sssd_LDAP.log ?

This seems strange:

(Mon Mar 14 17:46:10 2011) [sssd[nss]] [fill_pwent] (2): Incomplete or fake user object for bge[0]! Skipping

This means that user bge has uid set to 0 - we do this to create fake user members as optimization. However, they should be refreshed as they are marked as invalid in our cache to force an LDAP lookup.

If you're able to reproduce this again, it would be nice to see what is the output of:

ldbsearch -H /var/lib/sss/db/cache_LDAP.ldb name=bge

ldbsearch is part of ldb-tools package.

(Substitute bge for username that you are able to reproduce the issue with)

jhrozek commented 13 years ago

Fields changed

cc: => jhrozek

twhitehead commented 13 years ago

I hadn't realized you could specify debug_level for domains. I'll crank it up. I had another two more machines fail today too.

Currently I've downloaded the debug packages, attached gdb, and spent the day stepping through functions on the failed and non-failed machines to see what the difference is. So far I've only managed to figure out how things work in the sssd_nss code. Everything looks fine here.

- upon "gentent ..." running, nss_getpwnam gets called and requests the provider updates the cache over the embedded dbus,
- a successful result is eventually returned, just the subsequent call to ldb_search in the callback returns limited or no results.

Specifically, in the case of tyson, this is nothing. In the case of bge, it is just dataExpireTimestamp, lastUpdate, name, and objectClass. This ldbsearch command on the failed machine agrees with this (it shows that only these 4 of the 9 attributes requests in the code are in the directory)

$ ldbsearch -H /var/lib/sss/db/cache_LDAP.ldb '(name=tyson)'
asq: Unable to register control with rootdse!
# returned 0 records
# 0 entries
# 0 referrals


$ ldbsearch -H /var/lib/sss/db/cache_LDAP.ldb '(name=bge)'
asq: Unable to register control with rootdse!
# record 1
dn: name=bge,cn=users,cn=LDAP,cn=sysdb
createTimestamp: 1296077806
dataExpireTimestamp: 1296077805
lastLogin: 1296077806
lastUpdate: 1296077806
name: bge
objectClass: user
memberof: name=SN_staff,cn=groups,cn=LDAP,cn=sysdb
memberof: name=web_admin,cn=groups,cn=LDAP,cn=sysdb
memberof: name=fairshare-1,cn=groups,cn=LDAP,cn=sysdb
distinguishedName: name=bge,cn=users,cn=LDAP,cn=sysdb

# returned 1 records
# 1 entries
# 0 referrals

Non-failed machines return full records for both tyson and bge. For example,

$ ldbsearch -H /var/lib/sss/db/cache_LDAP.ldb '(name=tyson)'
asq: Unable to register control with rootdse!
# record 1
dn: name=tyson,cn=users,cn=LDAP,cn=sysdb
createTimestamp: 1300312746
fullName: Tyson Whitehead
gecos: Tyson Whitehead
gidNumber: 130018
homeDirectory: /home/tyson
loginShell: /bin/bash
name: tyson
objectClass: user
uidNumber: 130018
originalDN: uid=tyson,ou=People,dc=sharcnet,dc=ca
originalModifyTimestamp: 20110301190333Z
shadowLastChange: 13080
shadowMax: 99999
lastUpdate: 1300320922
dataExpireTimestamp: 1300326322
distinguishedName: name=tyson,cn=users,cn=LDAP,cn=sysdb

# returned 1 records
# 1 entries
# 0 referrals

Interestingly enough, if I try to dump everything with -a on the failed machine, it just hangs. This got me looking at locks, and I'm wondering if the underlying problem could be a deadlock blocking the cache update in the sssd_be process.

$ ps -A -o pid,args | egrep sssd_be
14653 /usr/libexec/sssd/sssd_be -d 0 --debug-to-files --domain LDAP


$ stat /var/lib/sss/db/cache_LDAP.ldb
  File: `/var/lib/sss/db/cache_LDAP.ldb'
  Size: 33751040        Blocks: 65920      IO Block: 4096   regular file
Device: fd01h/64769d    Inode: 393557      Links: 1
Access: (0600/-rw-------)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2011-03-16 20:06:51.742764543 -0400
Modify: 2011-03-10 16:35:13.705226652 -0500
Change: 2011-03-10 16:35:13.705226652 -0500


$ cat /proc/locks 
1: POSIX  ADVISORY  READ  14653 fd:01:393557 168 EOF
2: POSIX  ADVISORY  WRITE 14653 fd:01:393557 8 8
3: POSIX  ADVISORY  WRITE 2028 fd:01:395461 0 EOF
4: POSIX  ADVISORY  WRITE 2026 fd:01:395392 0 EOF
5: POSIX  ADVISORY  WRITE 1835 fd:01:395760 0 EOF
6: POSIX  ADVISORY  WRITE 1802 fd:01:395474 0 EOF
7: FLOCK  ADVISORY  WRITE 1452 fd:01:393303 0 EOF

You can see sssd_be (pid 14653) has an outstanding READ and a WRITE lock on "/var/lib/sss/db/cache_LDAP.ldb" (inode 353557) on the failed machine (and they haven't remain the same every time I've checked since I've noticed them -- around 30min so far). The non-failed machines show no outstanding locks by sssd_be.

If you have anything specific you want me to have a look at with the gdb on the failed machine (e.g., perhaps there is some data structures I could print that would reveal more information about these locks), just let me know.

Thanks! -Tyson

twhitehead commented 13 years ago

There was a bit of a garbled sentence in there. What I meant regarding the locks was that I've checked several times since noticing them and there has been no change.

jhrozek commented 13 years ago

Since the locks are in place, I suspect we are not finishing or cancelling a sysdb transaction correctly.

jhrozek commented 13 years ago

btw I think it would still be nice to see logs from the LDAP domain, the LDB transaction start/stop routines log, too, so we might get an idea where it went wrong..

twhitehead commented 13 years ago

I managed to turn up the debug level on the failed machine and captured the attached log entry (a bit long to include inline) when running

$ getent passwd tyson

I also verified that the LDAP server was indeed still returning the appropriate entries by examining the result messages in the sdap_process_message routine with gdb. However, as verified by

$ ldbsearch -H /var/lib/sss/db/cache_LDAP.ldb '(name=tyson)'

it just never seems to make it into the cache.

I've now restarted sssd with full debugging for the LDAP domain on 11 of our machines, so I'll hopefully have a full log (leading up to and after a failure) to attach soon.

I also happened to notice one of the old gzipped sssd_LDAP.log files was non-zero on the affected machine. It's content was

(Thu Mar 10 16:35:13 2011) [sssd[be[LDAP]]] [sdap_get_groups_done] (0): Could not cancel sysdb transaction
(Thu Mar 10 16:35:14 2011) [sssd[be[LDAP]]] [sdap_get_groups_done] (0): Could not cancel sysdb transaction
(Thu Mar 10 16:35:18 2011) [sssd[be[LDAP]]] [sdap_get_groups_done] (0): Could not cancel sysdb transaction
(Thu Mar 10 16:35:22 2011) [sssd[be[LDAP]]] [sdap_get_groups_done] (0): Could not cancel sysdb transaction
(Thu Mar 10 16:35:25 2011) [sssd[be[LDAP]]] [sdap_get_groups_done] (0): Could not cancel sysdb transaction
(Thu Mar 10 16:35:29 2011) [sssd[be[LDAP]]] [sdap_get_groups_done] (0): Could not cancel sysdb transaction
(Thu Mar 10 16:35:33 2011) [sssd[be[LDAP]]] [sdap_get_groups_done] (0): Could not cancel sysdb transaction
(Thu Mar 10 16:35:37 2011) [sssd[be[LDAP]]] [sdap_get_groups_done] (0): Could not cancel sysdb transaction

Likewise, on another machine I noticed had failed today (also contained active locks), I found a non-zero gzipped log file whose contents was

(Thu Mar 10 19:50:05 2011) [sssd[be[LDAP]]] [ldb] (0): A transaction is still active in ldb context [0x172b6d0] on /var/lib/sss/db/cache_LDAP.ldb

Cheers! -Tyson

twhitehead commented 13 years ago

sssd_LDAP.log file contents for "getent passwd tyson" on failed machine after turning up debug level via gdb
sssd_LDAP-tyson.log

twhitehead commented 13 years ago

Fields changed

cc: jhrozek => jhrozek, twhitehead

twhitehead commented 13 years ago

sssd_LDAP.log file contents for "getent passwd tyson" on failed machine after restarting sssd ("getent passwd tyson" returned correct results))
sssd_LDAP-tyson2.log

jhrozek commented 13 years ago

I'm sorry but I can't see any failure in the logs you attached. The only difference is that the file sssd_LDAP-tyson2.log shows storing a new entry into the database - as you can see from the sysdb_search_user_by_name() and sysdb_search_user_by_uid() calls returning 2, which means the entry was not cached yet.

The log file sssd_LDAP-tyson.log exhibits updating an existing user, possibly after a cache expire. Both logs report success in the end ([acctinfo_callback] (4): Request processed. Returned 0,0,Success)

Have you encountered the issue again? Are there any more messages mentioning "Could not cancel sysdb transaction" like the ones in comment 7 perhaps?

sgallagh commented 13 years ago

Fields changed

owner: somebody => jhrozek

twhitehead commented 13 years ago

We had another failure. This time I had full logging.

I'm putting them on our www server as they are a bit big (about 24MB total) and this way I can remove them later. I've mapped all our user and groups names to annon[to minimize the amount of internal information in the logs. The links are: [http://www.sharcnet.ca/~tyson/sssd.log sssd.log] (suppose to be empty), [http://www.sharcnet.ca/~tyson/sssd_nss.log sssd_nss.log], [http://www.sharcnet.ca/~tyson/sssd_pam.log sssd_pam.log], and http://www.sharcnet.ca/~tyson/sssd_LDAP.log sssd_LDAP.log.

I've grepped through them a bit, and it looks like the last nesting level 0 transaction occured on Wed Mar 23 13:48:48 2011.

(Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [sbus_dispatch] (9): dbus conn: FEB310
(Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [sbus_dispatch] (9): Dispatching.
(Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [sbus_message_handler] (9): Received SBUS method [getAccountInfo]
(Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [be_get_account_info] (4): Got request for [4098][1][idnumber=3000]
(Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [sdap_id_op_connect_step] (9): reusing cached connection
(Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [sdap_get_generic_send] (6): calling ldap_search_ext with [(&(gidNumber=3000)(objectclass=posixGroup))][dc=annon241,dc=ca].
(Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [sdap_get_generic_send] (7): Requesting attrs: [objectClass]
(Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [sdap_get_generic_send] (7): Requesting attrs: [cn]
(Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [sdap_get_generic_send] (7): Requesting attrs: [userPassword]
(Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [sdap_get_generic_send] (7): Requesting attrs: [gidNumber]
(Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [sdap_get_generic_send] (7): Requesting attrs: [memberuid]
(Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [sdap_get_generic_send] (7): Requesting attrs: [modifyTimestamp]
(Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [sdap_get_generic_send] (7): Requesting attrs: [modifyTimestamp]
(Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [sdap_get_generic_send] (8): ldap_search_ext called, msgid = 212
(Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [sdap_process_result] (8): Trace: sh[0xfda020], connected[1], ops[0x107b190], ldap[0x10a4ea0]
(Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [sdap_parse_entry] (9): OriginalDN: [cn=annon4088,ou=Group,dc=annon241,dc=ca].
(Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [sdap_process_result] (8): Trace: sh[0xfda020], connected[1], ops[0x107b190], ldap[0x10a4ea0]
(Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [sdap_get_generic_done] (6): Search result: Success(0), (null)
(Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [sdap_get_groups_process] (6): Search for groups, returned 1 results.
(Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [ldb] (9): tevent: Added timed event "ltdb_callback": 0x10af040

(Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [ldb] (9): tevent: Added timed event "ltdb_timeout": 0x10d2440

(Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [ldb] (9): tevent: Destroying timer event 0x10d2440 "ltdb_timeout"

(Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [ldb] (9): tevent: Ending timer event 0x10af040 "ltdb_callback"

(Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [sysdb_search_user_by_name] (6): Error: 2 (No such file or directory)
(Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [sdap_process_group_members_2307] (7): member #0 (annon1): not found in sysdb
(Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [sdap_process_missing_member_2307] (7): Adding a dummy entry
(Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [ldb] (9): start ldb transaction (nesting: 0)
(Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [ldb] (9): start ldb transaction (nesting: 1)
...

This carries on for quite awhile, and then, around where the transaction should have ended, there is instead a log message about a memory allocation failure.

...
(Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [sdap_process_group_members_2307] (7): Member already cached in sysdb: name=annon2725,cn=users,cn=LDAP,cn=sysdb
(Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [ldb] (9): tevent: Added timed event "ltdb_callback": 0x117b8f0

(Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [ldb] (9): tevent: Added timed event "ltdb_timeout": 0x116f7a0

(Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [ldb] (9): tevent: Destroying timer event 0x116f7a0 "ltdb_timeout"

(Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [ldb] (9): tevent: Ending timer event 0x117b8f0 "ltdb_callback"

(Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [sdap_process_group_members_2307] (7): Member already cached in sysdb: name=annon2726,cn=users,cn=LDAP,cn=sysdb
(Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [sysdb_search_user_by_name] (6): Error: 12 (Cannot allocate memory)
(Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [sdap_process_group_members_2307] (1): Error checking cache for member #150 ():
(Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [sdap_id_op_done] (9): releasing operation connection
(Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [acctinfo_callback] (4): Request processed. Returned 3,12,Out of memory
(Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [sdap_process_result] (8): Trace: sh[0xfda020], connected[1], ops[(nil)], ldap[0x10a4ea0]
(Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [sdap_process_result] (8): Trace: ldap_result found nothing!

This is quite strange as all these machines have either 8, 16, or 50GB of RAM, and, while I don't know what it was at that time, most of it is currently free and the kernel logs show no OOM activity.

Anyway, looks like the transaction is not properly released along the failure path.

Cheers! -Tyson

sgallagh commented 13 years ago

The ENOMEM is a red herring. It's set because when the function was originally created, the only failure case was "out of memory". There's now another failure case (which you're hitting) where we can't construct the search filter because the name value is unparseable.

Looking at the next line,

(Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [sdap_process_group_members_2307] (1): Error checking cache for member #150 ():

That trailing () is supposed to contain the name of the member. We're receiving a zero-length string for the member name, which the above parser is refusing.

Can you show us the complete LDAP entry for the group with GID 3000? I suspect you may have an empty member attribute.

twhitehead commented 13 years ago

This is definitely it. I just had another machine go off line and it's logs are showing the exact same thing. I also discovered I can immediately trigger it now on any of the machines by just doing

$ getent group annon4088

I've put the dump the results of doing

$ ldapsearch -x (gidNumber=3000)

on our www server too [group3000.txt], where I've mapped all user and group names to annonnumber as before and replaced the password with pass0.

I think I see what you are talking about. Approximately half way through there is a blank memberUid field.

# extended LDIF
#
# LDAPv3
# base <dc=annon241,dc=ca> (default) with scope subtree
# filter: (gidNumber=3000)
# requesting: ALL
#

# annon4088, Group, annon241.ca
dn: cn=annon4088,ou=Group,dc=annon241,dc=ca
objectClass: posixGroup
objectClass: top
cn: annon4088
userPassword:: pass0
gidNumber: 3000
memberUid: annon1
memberUid: annon82
...
memberUid: annon2725
memberUid: annon2726
memberUid:
memberUid: annon2756
memberUid: annon2784
...
memberUid: annon3526
memberUid: annon3517

# search result
search: 2
result: 0 Success

# numResponses: 2
# numEntries: 1

I'll follow up with our LDAP administrator regarding the empty entry.

I can certainly test any patch you may have to make sure the transaction gets released in the event of malformed entries like these first if you want.

Thanks very much! -Tyson

sgallagh commented 13 years ago

Fields changed

owner: jhrozek => sgallagh
status: new => assigned

sgallagh commented 13 years ago

Fields changed

component: SSSD => LDAP Provider
milestone: NEEDS_TRIAGE => SSSD 1.5.5
priority: major => blocker

sgallagh commented 13 years ago

So there were essentially two interrelated bugs here.

1) We were throwing an error failing to parse a memberuid (zero-length). We should be ignoring them instead.
2) When we got an error, we were not cancelling the in-progress transaction. This fouled up all future cache activities.

I have posted patches to fix both issues to the sssd-devel list:
https://fedorahosted.org/pipermail/sssd-devel/2011-March/005902.html

twhitehead commented 13 years ago

Those seem to have done it. First I just applied just the patch for condition (2) (not cancelling the in-progress transaction) to make sure it works. It did, as verified by

$ getent group annon4088
$

the log files showing that the transaction was being backed out as it should

...
(Mon Mar 28 10:53:38 2011) [sssd[be[LDAP]]] [sdap_process_group_members_2307] (1): Error checking cache for member #150
 ():
(Mon Mar 28 10:53:38 2011) [sssd[be[LDAP]]] [ldb] (9): cancel ldb transaction (nesting: 1)
(Mon Mar 28 10:53:38 2011) [sssd[be[LDAP]]] [ldb] (9): cancel ldb transaction (nesting: 0)
(Mon Mar 28 10:53:38 2011) [sssd[be[LDAP]]] [sdap_id_op_done] (9): releasing operation connection
(Mon Mar 28 10:53:38 2011) [sssd[be[LDAP]]] [acctinfo_callback] (4): Request processed. Returned 3,12,Out of memory

and further requests working as they should

$ getent passwd tyson
tyson:*:130018:130018:Tyson Whitehead:/home/tyson:/bin/bash
$

I then applied patch for condition (1) (ignoring zero-length memberuid fields) as well. It also worked, as verified by

$ getent group annon4088
... a lot of output ...
$

and the log files showed that the empty memberUid entry 150 was just skipped over

...
(Mon Mar 28 11:21:41 2011) [sssd[be[LDAP]]] [sdap_process_group_members_2307] (7): member #149 (annon2726): not found in sysdb
(Mon Mar 28 11:21:41 2011) [sssd[be[LDAP]]] [sdap_process_missing_member_2307] (7): Adding a dummy entry
(Mon Mar 28 11:21:41 2011) [sssd[be[LDAP]]] [ldb] (9): start ldb transaction (nesting: 2)
(Mon Mar 28 11:21:41 2011) [sssd[be[LDAP]]] [ldb] (9): tevent: Added timed event "ltdb_callback": 0x27d3b10

(Mon Mar 28 11:21:41 2011) [sssd[be[LDAP]]] [ldb] (9): tevent: Added timed event "ltdb_timeout": 0x27d3c30

(Mon Mar 28 11:21:41 2011) [sssd[be[LDAP]]] [ldb] (9): tevent: Destroying timer event 0x27d3c30 "ltdb_timeout"

(Mon Mar 28 11:21:41 2011) [sssd[be[LDAP]]] [ldb] (9): tevent: Ending timer event 0x27d3b10 "ltdb_callback"

(Mon Mar 28 11:21:41 2011) [sssd[be[LDAP]]] [ldb] (9): commit ldb transaction (nesting: 2)
(Mon Mar 28 11:21:41 2011) [sssd[be[LDAP]]] [ldb] (9): tevent: Added timed event "ltdb_callback": 0x27b19d0

(Mon Mar 28 11:21:41 2011) [sssd[be[LDAP]]] [ldb] (9): tevent: Added timed event "ltdb_timeout": 0x27d2c00

(Mon Mar 28 11:21:41 2011) [sssd[be[LDAP]]] [ldb] (9): tevent: Destroying timer event 0x27d2c00 "ltdb_timeout"

(Mon Mar 28 11:21:41 2011) [sssd[be[LDAP]]] [ldb] (9): tevent: Ending timer event 0x27b19d0 "ltdb_callback"

(Mon Mar 28 11:21:41 2011) [sssd[be[LDAP]]] [sysdb_search_user_by_name] (6): Error: 2 (No such file or directory)
(Mon Mar 28 11:21:41 2011) [sssd[be[LDAP]]] [sdap_process_group_members_2307] (7): member #151 (annon2756): not found in sysdb
(Mon Mar 28 11:21:41 2011) [sssd[be[LDAP]]] [sdap_process_missing_member_2307] (7): Adding a dummy entry
(Mon Mar 28 11:21:41 2011) [sssd[be[LDAP]]] [ldb] (9): start ldb transaction (nesting: 2)
(Mon Mar 28 11:21:41 2011) [sssd[be[LDAP]]] [ldb] (9): tevent: Added timed event "ltdb_callback": 0x27d3e90

(Mon Mar 28 11:21:41 2011) [sssd[be[LDAP]]] [ldb] (9): tevent: Added timed event "ltdb_timeout": 0x27d3070

(Mon Mar 28 11:21:41 2011) [sssd[be[LDAP]]] [ldb] (9): tevent: Destroying timer event 0x27d3070 "ltdb_timeout"

(Mon Mar 28 11:21:41 2011) [sssd[be[LDAP]]] [ldb] (9): tevent: Ending timer event 0x27d3e90 "ltdb_callback"

(Mon Mar 28 11:21:41 2011) [sssd[be[LDAP]]] [ldb] (9): commit ldb transaction (nesting: 2)
(Mon Mar 28 11:21:41 2011) [sssd[be[LDAP]]] [ldb] (9): tevent: Added timed event "ltdb_callback": 0x27d3c00

(Mon Mar 28 11:21:41 2011) [sssd[be[LDAP]]] [ldb] (9): tevent: Added timed event "ltdb_timeout": 0x27d3a70

(Mon Mar 28 11:21:41 2011) [sssd[be[LDAP]]] [ldb] (9): tevent: Destroying timer event 0x27d3a70 "ltdb_timeout"

(Mon Mar 28 11:21:41 2011) [sssd[be[LDAP]]] [ldb] (9): tevent: Ending timer event 0x27d3c00 "ltdb_callback"

(Mon Mar 28 11:21:41 2011) [sssd[be[LDAP]]] [sysdb_search_user_by_name] (6): Error: 2 (No such file or directory)
...

I'm please to say it looks like this bug report can be closed. Thanks very much for all the hard work.

Cheers! -Tyson

sgallagh commented 13 years ago

Fixed in:

master

- 97bd239c546febf20008e4e2d8a959b91c738d1f
- 95c3131ffeef889399f5afada8b8e47e7892d81f
- 1a881a1b5997494ef3228cf35ae0f2c7381d14ff

sssd-1-5

- 6d4430c2a6ae34012dabb95f82b32f7e8068700b
- 50968b666d1b2ca6ee4825b590bfd84ab1e32e55
- 2e59b72c0d8de3183996063c5d39be6b66c446f5

resolution: => fixed
status: assigned => closed
tests: 0 => 1

dpal commented 12 years ago

Fields changed

rhbz: => 0

Metadata Update from @twhitehead:
- Issue assigned to sgallagh
- Issue set to the milestone: SSSD 1.5.5

7 years ago

pbrezina commented 3 years ago

SSSD is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in SSSD's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/SSSD/sssd/issues/1869

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata

Assignee

sgallagh

Tags

None

Blocking

None

Depending on

None

Priority

blocker

Milestone

SSSD 1.5.5

type

defect

component

LDAP Provider

version

1.5.1

selected

None

testsupdated

patch

None

rhbz

design_review

None

review

None

changelog

None

keywords

None

coverity

None

mark

None

blocking

None

design

None

sensitive

None

jhrozek, twhitehead

blockedby

None

feature_milestone

None

SSSD / sssd

Source Code

Documentation

#827 Something breaks mysteriously after a couple of weeks of running Closed: Fixed None Opened 13 years ago by twhitehead.

master

sssd-1-5

Metadata

#827 Something breaks mysteriously after a couple of weeks of running

Closed: Fixed None Opened 13 years ago by twhitehead.