Learn more about these different git repos.
Other Git URLs
We have an LDAP based setup using sssd for nss and pam on Fedora 14. After an apparently random number of days, sssd just stops returning entries until it is restarted.
For example, currently on one of our machines, I get
$ getent passwd tyson $
while on another identically configured machine, I get
$ getent passwd tyson tyson:*:130018:130018:Tyson Whitehead:/home/tyson:/bin/bash $
This is the third our fourth time in last two or three months that this has happened, so I've turned the verbosity all the way up in logs. The first machines's sssd_nss.log says
(Mon Mar 14 14:26:25 2011) [sssd[nss]] [accept_fd_handler] (4): Client connected! (Mon Mar 14 14:26:25 2011) [sssd[nss]] [sss_cmd_get_version] (5): Received client version [1]. (Mon Mar 14 14:26:25 2011) [sssd[nss]] [sss_cmd_get_version] (5): Offered version [1]. (Mon Mar 14 14:26:25 2011) [sssd[nss]] [nss_cmd_getpwnam] (4): Requesting info for [tyson] from [<ALL>] (Mon Mar 14 14:26:25 2011) [sssd[nss]] [nss_cmd_getpwnam_search] (4): Requesting info for [tyson@LDAP] (Mon Mar 14 14:26:25 2011) [sssd[nss]] [ldb] (9): tevent: Added timed event "ltdb_callback": 0x110b2a0 (Mon Mar 14 14:26:25 2011) [sssd[nss]] [ldb] (9): tevent: Added timed event "ltdb_timeout": 0x110b390 (Mon Mar 14 14:26:25 2011) [sssd[nss]] [ldb] (9): tevent: Destroying timer event 0x110b390 "ltdb_timeout" (Mon Mar 14 14:26:25 2011) [sssd[nss]] [ldb] (9): tevent: Ending timer event 0x110b2a0 "ltdb_callback" (Mon Mar 14 14:26:25 2011) [sssd[nss]] [sss_dp_send_acct_req_create] (4): Sending request for [LDAP][4097][1][name=tyson] (Mon Mar 14 14:26:25 2011) [sssd[nss]] [sbus_add_timeout] (8): 0x10daf40 (Mon Mar 14 14:26:25 2011) [sssd[nss]] [sbus_remove_timeout] (8): 0x10daf40 (Mon Mar 14 14:26:25 2011) [sssd[nss]] [sbus_dispatch] (9): dbus conn: 10C15C0 (Mon Mar 14 14:26:25 2011) [sssd[nss]] [sbus_dispatch] (9): Dispatching. (Mon Mar 14 14:26:25 2011) [sssd[nss]] [sss_dp_get_reply] (4): Got reply (0, 0, Success) from Data Provider (Mon Mar 14 14:26:25 2011) [sssd[nss]] [nss_cmd_getpwnam_search] (4): Requesting info for [tyson@LDAP] (Mon Mar 14 14:26:25 2011) [sssd[nss]] [ldb] (9): tevent: Added timed event "ltdb_callback": 0x110b7f0 (Mon Mar 14 14:26:25 2011) [sssd[nss]] [ldb] (9): tevent: Added timed event "ltdb_timeout": 0x110b8a0 (Mon Mar 14 14:26:25 2011) [sssd[nss]] [ldb] (9): tevent: Destroying timer event 0x110b8a0 "ltdb_timeout" (Mon Mar 14 14:26:25 2011) [sssd[nss]] [ldb] (9): tevent: Ending timer event 0x110b7f0 "ltdb_callback" (Mon Mar 14 14:26:25 2011) [sssd[nss]] [nss_cmd_getpwnam_search] (2): No matching domain found for [tyson], fail! (Mon Mar 14 14:26:26 2011) [sssd[nss]] [client_recv] (5): Client disconnected!
The second machines sssd_nss.log says
(Mon Mar 14 14:30:13 2011) [sssd[nss]] [accept_fd_handler] (4): Client connected! (Mon Mar 14 14:30:13 2011) [sssd[nss]] [sss_cmd_get_version] (5): Received client version [1]. (Mon Mar 14 14:30:13 2011) [sssd[nss]] [sss_cmd_get_version] (5): Offered version [1]. (Mon Mar 14 14:30:13 2011) [sssd[nss]] [nss_cmd_getpwnam] (4): Requesting info for [tyson] from [<ALL>] (Mon Mar 14 14:30:13 2011) [sssd[nss]] [nss_cmd_getpwnam_search] (4): Requesting info for [tyson@LDAP] (Mon Mar 14 14:30:13 2011) [sssd[nss]] [ldb] (9): tevent: Added timed event "ltdb_callback": 0x22ee620 (Mon Mar 14 14:30:13 2011) [sssd[nss]] [ldb] (9): tevent: Added timed event "ltdb_timeout": 0x22ee740 (Mon Mar 14 14:30:13 2011) [sssd[nss]] [ldb] (9): tevent: Destroying timer event 0x22ee740 "ltdb_timeout" (Mon Mar 14 14:30:13 2011) [sssd[nss]] [ldb] (9): tevent: Ending timer event 0x22ee620 "ltdb_callback" (Mon Mar 14 14:30:13 2011) [sssd[nss]] [sss_dp_send_acct_req_create] (4): Sending request for [LDAP][4097][1][name=tyson] (Mon Mar 14 14:30:13 2011) [sssd[nss]] [sbus_add_timeout] (8): 0x22eda40 (Mon Mar 14 14:30:13 2011) [sssd[nss]] [sbus_remove_timeout] (8): 0x22eda40 (Mon Mar 14 14:30:13 2011) [sssd[nss]] [sbus_dispatch] (9): dbus conn: 22C3F40 (Mon Mar 14 14:30:13 2011) [sssd[nss]] [sbus_dispatch] (9): Dispatching. (Mon Mar 14 14:30:13 2011) [sssd[nss]] [sss_dp_get_reply] (4): Got reply (0, 0, Success) from Data Provider (Mon Mar 14 14:30:13 2011) [sssd[nss]] [nss_cmd_getpwnam_search] (4): Requesting info for [tyson@LDAP] (Mon Mar 14 14:30:13 2011) [sssd[nss]] [ldb] (9): tevent: Added timed event "ltdb_callback": 0x22ee740 (Mon Mar 14 14:30:13 2011) [sssd[nss]] [ldb] (9): tevent: Added timed event "ltdb_timeout": 0x22ee580 (Mon Mar 14 14:30:13 2011) [sssd[nss]] [ldb] (9): tevent: Destroying timer event 0x22ee580 "ltdb_timeout" (Mon Mar 14 14:30:13 2011) [sssd[nss]] [ldb] (9): tevent: Ending timer event 0x22ee740 "ltdb_callback" (Mon Mar 14 14:30:13 2011) [sssd[nss]] [nss_cmd_getpwnam_search] (6): Returning info for user [tyson@LDAP] (Mon Mar 14 14:30:13 2011) [sssd[nss]] [client_recv] (5): Client disconnected!
The only apparent difference between these is the last line. Running tcpdump and/or stracing the process verifies that both machines are still talking to the LDAP server. It is over SSL though, so I can't tell what they are saying.
Before, when this has happened, it always seemed as if the services have stopped working for all users, but this time I'm noticing a variety of reponses. For example
$ getent passwd gary gary:*:130000:130000:Gary Molenkamp:/home/gary:/bin/bash $ getent passwd bge $
While on the second machine
$ getent passwd gary gary:*:130000:130000:Gary Molenkamp:/home/gary:/bin/bash $ getent passwd bge bge:*:130127:130127:Baolai Ge:/home/bge:/bin/bash $
Interestingly enough, the logs shows a different reasons for user bge failing than user tyson
(Mon Mar 14 17:46:10 2011) [sssd[nss]] [accept_fd_handler] (4): Client connected! (Mon Mar 14 17:46:10 2011) [sssd[nss]] [sss_cmd_get_version] (5): Received client version [1]. (Mon Mar 14 17:46:10 2011) [sssd[nss]] [sss_cmd_get_version] (5): Offered version [1]. (Mon Mar 14 17:46:10 2011) [sssd[nss]] [nss_cmd_getpwnam] (4): Requesting info for [bge] from [<ALL>] (Mon Mar 14 17:46:10 2011) [sssd[nss]] [nss_cmd_getpwnam_search] (4): Requesting info for [bge@LDAP] (Mon Mar 14 17:46:10 2011) [sssd[nss]] [ldb] (9): tevent: Added timed event "ltdb_callback": 0x1101f00 (Mon Mar 14 17:46:10 2011) [sssd[nss]] [ldb] (9): tevent: Added timed event "ltdb_timeout": 0x11021e0 (Mon Mar 14 17:46:10 2011) [sssd[nss]] [ldb] (9): tevent: Destroying timer event 0x11021e0 "ltdb_timeout" (Mon Mar 14 17:46:10 2011) [sssd[nss]] [ldb] (9): tevent: Ending timer event 0x1101f00 "ltdb_callback" (Mon Mar 14 17:46:10 2011) [sssd[nss]] [sss_dp_send_acct_req_create] (4): Sending request for [LDAP][4097][1][name=bge] (Mon Mar 14 17:46:10 2011) [sssd[nss]] [sbus_add_timeout] (8): 0x10e1f40 (Mon Mar 14 17:46:10 2011) [sssd[nss]] [sbus_remove_timeout] (8): 0x10e1f40 (Mon Mar 14 17:46:10 2011) [sssd[nss]] [sbus_dispatch] (9): dbus conn: 10C15C0 (Mon Mar 14 17:46:10 2011) [sssd[nss]] [sbus_dispatch] (9): Dispatching. (Mon Mar 14 17:46:10 2011) [sssd[nss]] [sss_dp_get_reply] (4): Got reply (0, 0, Success) from Data Provider (Mon Mar 14 17:46:10 2011) [sssd[nss]] [nss_cmd_getpwnam_search] (4): Requesting info for [bge@LDAP] (Mon Mar 14 17:46:10 2011) [sssd[nss]] [ldb] (9): tevent: Added timed event "ltdb_callback": 0x11021e0 (Mon Mar 14 17:46:10 2011) [sssd[nss]] [ldb] (9): tevent: Added timed event "ltdb_timeout": 0x11222d0 (Mon Mar 14 17:46:10 2011) [sssd[nss]] [ldb] (9): tevent: Destroying timer event 0x11222d0 "ltdb_timeout" (Mon Mar 14 17:46:10 2011) [sssd[nss]] [ldb] (9): tevent: Ending timer event 0x11021e0 "ltdb_callback" (Mon Mar 14 17:46:10 2011) [sssd[nss]] [nss_cmd_getpwnam_search] (6): Returning info for user [bge@LDAP] (Mon Mar 14 17:46:10 2011) [sssd[nss]] [fill_pwent] (2): Incomplete or fake user object for bge[0]! Skipping (Mon Mar 14 17:46:10 2011) [sssd[nss]] [client_recv] (5): Client disconnected!
The logs on the second machine show everything is fine
(Mon Mar 14 17:47:47 2011) [sssd[nss]] [accept_fd_handler] (4): Client connected! (Mon Mar 14 17:47:47 2011) [sssd[nss]] [sss_cmd_get_version] (5): Received client version [1]. (Mon Mar 14 17:47:47 2011) [sssd[nss]] [sss_cmd_get_version] (5): Offered version [1]. (Mon Mar 14 17:47:47 2011) [sssd[nss]] [nss_cmd_getpwnam] (4): Requesting info for [bge] from [<ALL>] (Mon Mar 14 17:47:47 2011) [sssd[nss]] [nss_cmd_getpwnam_search] (4): Requesting info for [bge@LDAP] (Mon Mar 14 17:47:47 2011) [sssd[nss]] [ldb] (9): tevent: Added timed event "ltdb_callback": 0x22f58b0 (Mon Mar 14 17:47:47 2011) [sssd[nss]] [ldb] (9): tevent: Added timed event "ltdb_timeout": 0x22e8e80 (Mon Mar 14 17:47:47 2011) [sssd[nss]] [ldb] (9): tevent: Destroying timer event 0x22e8e80 "ltdb_timeout" (Mon Mar 14 17:47:47 2011) [sssd[nss]] [ldb] (9): tevent: Ending timer event 0x22f58b0 "ltdb_callback" (Mon Mar 14 17:47:47 2011) [sssd[nss]] [sss_dp_send_acct_req_create] (4): Sending request for [LDAP][4097][1][name=bge] (Mon Mar 14 17:47:47 2011) [sssd[nss]] [sbus_add_timeout] (8): 0x22f5910 (Mon Mar 14 17:47:48 2011) [sssd[nss]] [sbus_dispatch] (9): dbus conn: 22C1C10 (Mon Mar 14 17:47:48 2011) [sssd[nss]] [sbus_dispatch] (9): Dispatching. (Mon Mar 14 17:47:48 2011) [sssd[nss]] [sbus_message_handler] (9): Received SBUS method [ping] (Mon Mar 14 17:47:48 2011) [sssd[nss]] [sbus_remove_timeout] (8): 0x22f5910 (Mon Mar 14 17:47:48 2011) [sssd[nss]] [sbus_dispatch] (9): dbus conn: 22C3F40 (Mon Mar 14 17:47:48 2011) [sssd[nss]] [sbus_dispatch] (9): Dispatching. (Mon Mar 14 17:47:48 2011) [sssd[nss]] [sss_dp_get_reply] (4): Got reply (0, 0, Success) from Data Provider (Mon Mar 14 17:47:48 2011) [sssd[nss]] [nss_cmd_getpwnam_search] (4): Requesting info for [bge@LDAP] (Mon Mar 14 17:47:48 2011) [sssd[nss]] [ldb] (9): tevent: Added timed event "ltdb_callback": 0x22e8e80 (Mon Mar 14 17:47:48 2011) [sssd[nss]] [ldb] (9): tevent: Added timed event "ltdb_timeout": 0x22ed070 (Mon Mar 14 17:47:48 2011) [sssd[nss]] [ldb] (9): tevent: Destroying timer event 0x22ed070 "ltdb_timeout" (Mon Mar 14 17:47:48 2011) [sssd[nss]] [ldb] (9): tevent: Ending timer event 0x22e8e80 "ltdb_callback" (Mon Mar 14 17:47:48 2011) [sssd[nss]] [nss_cmd_getpwnam_search] (6): Returning info for user [bge@LDAP] (Mon Mar 14 17:47:48 2011) [sssd[nss]] [client_recv] (5): Client disconnected!
The non-commented out ('#' or ';') lines in our (identical) configuration files are
[sssd] config_file_version = 2 reconnection_retries = 3 sbus_timeout = 30 services = nss, pam domains = LDAP [nss] debug_level = 9 filter_groups = root filter_users = root reconnection_retries = 3 [pam] debug_level = 9 reconnection_retries = 3 [domain/LDAP] id_provider = ldap auth_provider = ldap ldap_search_base = dc=sharcnet,dc=ca ldap_uri = ldaps://ns1.sharcnet.ca ldap_tls_cipher_suite = HIGH ldap_tls_cacert = /etc/pki/tls/certs/viz-ldap-cacert.pem ldap_tls_cert = /etc/pki/tls/private/viz-ldap-cert.pem ldap_tls_key = /etc/pki/tls/private/viz-ldap-key.pem
Thanks! -Tyson
PS: I googled the log messages a bit, and the nss_cmd_getpwnam_search code came up. It seems quite strange that the line "[nss_cmd_getpwnam_search] (4): Requesting info for [XXX@LDAP]" is logged twice as it looks like it should only happen once.
Can you also add debug_level = 9 to [domain/LDAP] and then attach /var/log/sss/sssd_LDAP.log ?
debug_level = 9
[domain/LDAP]
/var/log/sss/sssd_LDAP.log
This seems strange:
(Mon Mar 14 17:46:10 2011) [sssd[nss]] [fill_pwent] (2): Incomplete or fake user object for bge[0]! Skipping
This means that user bge has uid set to 0 - we do this to create fake user members as optimization. However, they should be refreshed as they are marked as invalid in our cache to force an LDAP lookup.
If you're able to reproduce this again, it would be nice to see what is the output of:
ldbsearch -H /var/lib/sss/db/cache_LDAP.ldb name=bge
ldbsearch is part of ldb-tools package.
(Substitute bge for username that you are able to reproduce the issue with)
Fields changed
cc: => jhrozek
I hadn't realized you could specify debug_level for domains. I'll crank it up. I had another two more machines fail today too.
Currently I've downloaded the debug packages, attached gdb, and spent the day stepping through functions on the failed and non-failed machines to see what the difference is. So far I've only managed to figure out how things work in the sssd_nss code. Everything looks fine here.
- upon "gentent ..." running, nss_getpwnam gets called and requests the provider updates the cache over the embedded dbus, - a successful result is eventually returned, just the subsequent call to ldb_search in the callback returns limited or no results.
Specifically, in the case of tyson, this is nothing. In the case of bge, it is just dataExpireTimestamp, lastUpdate, name, and objectClass. This ldbsearch command on the failed machine agrees with this (it shows that only these 4 of the 9 attributes requests in the code are in the directory)
$ ldbsearch -H /var/lib/sss/db/cache_LDAP.ldb '(name=tyson)' asq: Unable to register control with rootdse! # returned 0 records # 0 entries # 0 referrals $ ldbsearch -H /var/lib/sss/db/cache_LDAP.ldb '(name=bge)' asq: Unable to register control with rootdse! # record 1 dn: name=bge,cn=users,cn=LDAP,cn=sysdb createTimestamp: 1296077806 dataExpireTimestamp: 1296077805 lastLogin: 1296077806 lastUpdate: 1296077806 name: bge objectClass: user memberof: name=SN_staff,cn=groups,cn=LDAP,cn=sysdb memberof: name=web_admin,cn=groups,cn=LDAP,cn=sysdb memberof: name=fairshare-1,cn=groups,cn=LDAP,cn=sysdb distinguishedName: name=bge,cn=users,cn=LDAP,cn=sysdb # returned 1 records # 1 entries # 0 referrals
Non-failed machines return full records for both tyson and bge. For example,
$ ldbsearch -H /var/lib/sss/db/cache_LDAP.ldb '(name=tyson)' asq: Unable to register control with rootdse! # record 1 dn: name=tyson,cn=users,cn=LDAP,cn=sysdb createTimestamp: 1300312746 fullName: Tyson Whitehead gecos: Tyson Whitehead gidNumber: 130018 homeDirectory: /home/tyson loginShell: /bin/bash name: tyson objectClass: user uidNumber: 130018 originalDN: uid=tyson,ou=People,dc=sharcnet,dc=ca originalModifyTimestamp: 20110301190333Z shadowLastChange: 13080 shadowMax: 99999 lastUpdate: 1300320922 dataExpireTimestamp: 1300326322 distinguishedName: name=tyson,cn=users,cn=LDAP,cn=sysdb # returned 1 records # 1 entries # 0 referrals
Interestingly enough, if I try to dump everything with -a on the failed machine, it just hangs. This got me looking at locks, and I'm wondering if the underlying problem could be a deadlock blocking the cache update in the sssd_be process.
$ ps -A -o pid,args | egrep sssd_be 14653 /usr/libexec/sssd/sssd_be -d 0 --debug-to-files --domain LDAP $ stat /var/lib/sss/db/cache_LDAP.ldb File: `/var/lib/sss/db/cache_LDAP.ldb' Size: 33751040 Blocks: 65920 IO Block: 4096 regular file Device: fd01h/64769d Inode: 393557 Links: 1 Access: (0600/-rw-------) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2011-03-16 20:06:51.742764543 -0400 Modify: 2011-03-10 16:35:13.705226652 -0500 Change: 2011-03-10 16:35:13.705226652 -0500 $ cat /proc/locks 1: POSIX ADVISORY READ 14653 fd:01:393557 168 EOF 2: POSIX ADVISORY WRITE 14653 fd:01:393557 8 8 3: POSIX ADVISORY WRITE 2028 fd:01:395461 0 EOF 4: POSIX ADVISORY WRITE 2026 fd:01:395392 0 EOF 5: POSIX ADVISORY WRITE 1835 fd:01:395760 0 EOF 6: POSIX ADVISORY WRITE 1802 fd:01:395474 0 EOF 7: FLOCK ADVISORY WRITE 1452 fd:01:393303 0 EOF
You can see sssd_be (pid 14653) has an outstanding READ and a WRITE lock on "/var/lib/sss/db/cache_LDAP.ldb" (inode 353557) on the failed machine (and they haven't remain the same every time I've checked since I've noticed them -- around 30min so far). The non-failed machines show no outstanding locks by sssd_be.
If you have anything specific you want me to have a look at with the gdb on the failed machine (e.g., perhaps there is some data structures I could print that would reveal more information about these locks), just let me know.
There was a bit of a garbled sentence in there. What I meant regarding the locks was that I've checked several times since noticing them and there has been no change.
Since the locks are in place, I suspect we are not finishing or cancelling a sysdb transaction correctly.
btw I think it would still be nice to see logs from the LDAP domain, the LDB transaction start/stop routines log, too, so we might get an idea where it went wrong..
I managed to turn up the debug level on the failed machine and captured the attached log entry (a bit long to include inline) when running
$ getent passwd tyson
I also verified that the LDAP server was indeed still returning the appropriate entries by examining the result messages in the sdap_process_message routine with gdb. However, as verified by
$ ldbsearch -H /var/lib/sss/db/cache_LDAP.ldb '(name=tyson)'
it just never seems to make it into the cache.
I've now restarted sssd with full debugging for the LDAP domain on 11 of our machines, so I'll hopefully have a full log (leading up to and after a failure) to attach soon.
I also happened to notice one of the old gzipped sssd_LDAP.log files was non-zero on the affected machine. It's content was
(Thu Mar 10 16:35:13 2011) [sssd[be[LDAP]]] [sdap_get_groups_done] (0): Could not cancel sysdb transaction (Thu Mar 10 16:35:14 2011) [sssd[be[LDAP]]] [sdap_get_groups_done] (0): Could not cancel sysdb transaction (Thu Mar 10 16:35:18 2011) [sssd[be[LDAP]]] [sdap_get_groups_done] (0): Could not cancel sysdb transaction (Thu Mar 10 16:35:22 2011) [sssd[be[LDAP]]] [sdap_get_groups_done] (0): Could not cancel sysdb transaction (Thu Mar 10 16:35:25 2011) [sssd[be[LDAP]]] [sdap_get_groups_done] (0): Could not cancel sysdb transaction (Thu Mar 10 16:35:29 2011) [sssd[be[LDAP]]] [sdap_get_groups_done] (0): Could not cancel sysdb transaction (Thu Mar 10 16:35:33 2011) [sssd[be[LDAP]]] [sdap_get_groups_done] (0): Could not cancel sysdb transaction (Thu Mar 10 16:35:37 2011) [sssd[be[LDAP]]] [sdap_get_groups_done] (0): Could not cancel sysdb transaction
Likewise, on another machine I noticed had failed today (also contained active locks), I found a non-zero gzipped log file whose contents was
(Thu Mar 10 19:50:05 2011) [sssd[be[LDAP]]] [ldb] (0): A transaction is still active in ldb context [0x172b6d0] on /var/lib/sss/db/cache_LDAP.ldb
Cheers! -Tyson
sssd_LDAP.log file contents for "getent passwd tyson" on failed machine after turning up debug level via gdb sssd_LDAP-tyson.log
cc: jhrozek => jhrozek, twhitehead
sssd_LDAP.log file contents for "getent passwd tyson" on failed machine after restarting sssd ("getent passwd tyson" returned correct results)) sssd_LDAP-tyson2.log
I'm sorry but I can't see any failure in the logs you attached. The only difference is that the file sssd_LDAP-tyson2.log shows storing a new entry into the database - as you can see from the sysdb_search_user_by_name() and sysdb_search_user_by_uid() calls returning 2, which means the entry was not cached yet.
sssd_LDAP-tyson2.log
sysdb_search_user_by_name()
sysdb_search_user_by_uid()
The log file sssd_LDAP-tyson.log exhibits updating an existing user, possibly after a cache expire. Both logs report success in the end ([acctinfo_callback] (4): Request processed. Returned 0,0,Success)
sssd_LDAP-tyson.log
[acctinfo_callback] (4): Request processed. Returned 0,0,Success
Have you encountered the issue again? Are there any more messages mentioning "Could not cancel sysdb transaction" like the ones in comment 7 perhaps?
"Could not cancel sysdb transaction"
owner: somebody => jhrozek
We had another failure. This time I had full logging.
I'm putting them on our www server as they are a bit big (about 24MB total) and this way I can remove them later. I've mapped all our user and groups names to annon[to minimize the amount of internal information in the logs. The links are: [http://www.sharcnet.ca/~tyson/sssd.log sssd.log] (suppose to be empty), [http://www.sharcnet.ca/~tyson/sssd_nss.log sssd_nss.log], [http://www.sharcnet.ca/~tyson/sssd_pam.log sssd_pam.log], and http://www.sharcnet.ca/~tyson/sssd_LDAP.log sssd_LDAP.log.
I've grepped through them a bit, and it looks like the last nesting level 0 transaction occured on Wed Mar 23 13:48:48 2011.
(Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [sbus_dispatch] (9): dbus conn: FEB310 (Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [sbus_dispatch] (9): Dispatching. (Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [sbus_message_handler] (9): Received SBUS method [getAccountInfo] (Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [be_get_account_info] (4): Got request for [4098][1][idnumber=3000] (Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [sdap_id_op_connect_step] (9): reusing cached connection (Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [sdap_get_generic_send] (6): calling ldap_search_ext with [(&(gidNumber=3000)(objectclass=posixGroup))][dc=annon241,dc=ca]. (Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [sdap_get_generic_send] (7): Requesting attrs: [objectClass] (Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [sdap_get_generic_send] (7): Requesting attrs: [cn] (Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [sdap_get_generic_send] (7): Requesting attrs: [userPassword] (Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [sdap_get_generic_send] (7): Requesting attrs: [gidNumber] (Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [sdap_get_generic_send] (7): Requesting attrs: [memberuid] (Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [sdap_get_generic_send] (7): Requesting attrs: [modifyTimestamp] (Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [sdap_get_generic_send] (7): Requesting attrs: [modifyTimestamp] (Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [sdap_get_generic_send] (8): ldap_search_ext called, msgid = 212 (Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [sdap_process_result] (8): Trace: sh[0xfda020], connected[1], ops[0x107b190], ldap[0x10a4ea0] (Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [sdap_parse_entry] (9): OriginalDN: [cn=annon4088,ou=Group,dc=annon241,dc=ca]. (Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [sdap_process_result] (8): Trace: sh[0xfda020], connected[1], ops[0x107b190], ldap[0x10a4ea0] (Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [sdap_get_generic_done] (6): Search result: Success(0), (null) (Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [sdap_get_groups_process] (6): Search for groups, returned 1 results. (Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [ldb] (9): tevent: Added timed event "ltdb_callback": 0x10af040 (Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [ldb] (9): tevent: Added timed event "ltdb_timeout": 0x10d2440 (Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [ldb] (9): tevent: Destroying timer event 0x10d2440 "ltdb_timeout" (Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [ldb] (9): tevent: Ending timer event 0x10af040 "ltdb_callback" (Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [sysdb_search_user_by_name] (6): Error: 2 (No such file or directory) (Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [sdap_process_group_members_2307] (7): member #0 (annon1): not found in sysdb (Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [sdap_process_missing_member_2307] (7): Adding a dummy entry (Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [ldb] (9): start ldb transaction (nesting: 0) (Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [ldb] (9): start ldb transaction (nesting: 1) ...
This carries on for quite awhile, and then, around where the transaction should have ended, there is instead a log message about a memory allocation failure.
... (Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [sdap_process_group_members_2307] (7): Member already cached in sysdb: name=annon2725,cn=users,cn=LDAP,cn=sysdb (Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [ldb] (9): tevent: Added timed event "ltdb_callback": 0x117b8f0 (Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [ldb] (9): tevent: Added timed event "ltdb_timeout": 0x116f7a0 (Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [ldb] (9): tevent: Destroying timer event 0x116f7a0 "ltdb_timeout" (Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [ldb] (9): tevent: Ending timer event 0x117b8f0 "ltdb_callback" (Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [sdap_process_group_members_2307] (7): Member already cached in sysdb: name=annon2726,cn=users,cn=LDAP,cn=sysdb (Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [sysdb_search_user_by_name] (6): Error: 12 (Cannot allocate memory) (Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [sdap_process_group_members_2307] (1): Error checking cache for member #150 (): (Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [sdap_id_op_done] (9): releasing operation connection (Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [acctinfo_callback] (4): Request processed. Returned 3,12,Out of memory (Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [sdap_process_result] (8): Trace: sh[0xfda020], connected[1], ops[(nil)], ldap[0x10a4ea0] (Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [sdap_process_result] (8): Trace: ldap_result found nothing!
This is quite strange as all these machines have either 8, 16, or 50GB of RAM, and, while I don't know what it was at that time, most of it is currently free and the kernel logs show no OOM activity.
Anyway, looks like the transaction is not properly released along the failure path.
The ENOMEM is a red herring. It's set because when the function was originally created, the only failure case was "out of memory". There's now another failure case (which you're hitting) where we can't construct the search filter because the name value is unparseable.
Looking at the next line,
(Wed Mar 23 13:48:48 2011) [sssd[be[LDAP]]] [sdap_process_group_members_2307] (1): Error checking cache for member #150 ():
That trailing () is supposed to contain the name of the member. We're receiving a zero-length string for the member name, which the above parser is refusing.
Can you show us the complete LDAP entry for the group with GID 3000? I suspect you may have an empty member attribute.
This is definitely it. I just had another machine go off line and it's logs are showing the exact same thing. I also discovered I can immediately trigger it now on any of the machines by just doing
$ getent group annon4088
I've put the dump the results of doing
$ ldapsearch -x (gidNumber=3000)
on our www server too [group3000.txt], where I've mapped all user and group names to annonnumber as before and replaced the password with pass0.
I think I see what you are talking about. Approximately half way through there is a blank memberUid field.
# extended LDIF # # LDAPv3 # base <dc=annon241,dc=ca> (default) with scope subtree # filter: (gidNumber=3000) # requesting: ALL # # annon4088, Group, annon241.ca dn: cn=annon4088,ou=Group,dc=annon241,dc=ca objectClass: posixGroup objectClass: top cn: annon4088 userPassword:: pass0 gidNumber: 3000 memberUid: annon1 memberUid: annon82 ... memberUid: annon2725 memberUid: annon2726 memberUid: memberUid: annon2756 memberUid: annon2784 ... memberUid: annon3526 memberUid: annon3517 # search result search: 2 result: 0 Success # numResponses: 2 # numEntries: 1
I'll follow up with our LDAP administrator regarding the empty entry.
I can certainly test any patch you may have to make sure the transaction gets released in the event of malformed entries like these first if you want.
Thanks very much! -Tyson
owner: jhrozek => sgallagh status: new => assigned
component: SSSD => LDAP Provider milestone: NEEDS_TRIAGE => SSSD 1.5.5 priority: major => blocker
So there were essentially two interrelated bugs here.
1) We were throwing an error failing to parse a memberuid (zero-length). We should be ignoring them instead. 2) When we got an error, we were not cancelling the in-progress transaction. This fouled up all future cache activities.
I have posted patches to fix both issues to the sssd-devel list: https://fedorahosted.org/pipermail/sssd-devel/2011-March/005902.html
Those seem to have done it. First I just applied just the patch for condition (2) (not cancelling the in-progress transaction) to make sure it works. It did, as verified by
$ getent group annon4088 $
the log files showing that the transaction was being backed out as it should
... (Mon Mar 28 10:53:38 2011) [sssd[be[LDAP]]] [sdap_process_group_members_2307] (1): Error checking cache for member #150 (): (Mon Mar 28 10:53:38 2011) [sssd[be[LDAP]]] [ldb] (9): cancel ldb transaction (nesting: 1) (Mon Mar 28 10:53:38 2011) [sssd[be[LDAP]]] [ldb] (9): cancel ldb transaction (nesting: 0) (Mon Mar 28 10:53:38 2011) [sssd[be[LDAP]]] [sdap_id_op_done] (9): releasing operation connection (Mon Mar 28 10:53:38 2011) [sssd[be[LDAP]]] [acctinfo_callback] (4): Request processed. Returned 3,12,Out of memory
and further requests working as they should
I then applied patch for condition (1) (ignoring zero-length memberuid fields) as well. It also worked, as verified by
$ getent group annon4088 ... a lot of output ... $
and the log files showed that the empty memberUid entry 150 was just skipped over
... (Mon Mar 28 11:21:41 2011) [sssd[be[LDAP]]] [sdap_process_group_members_2307] (7): member #149 (annon2726): not found in sysdb (Mon Mar 28 11:21:41 2011) [sssd[be[LDAP]]] [sdap_process_missing_member_2307] (7): Adding a dummy entry (Mon Mar 28 11:21:41 2011) [sssd[be[LDAP]]] [ldb] (9): start ldb transaction (nesting: 2) (Mon Mar 28 11:21:41 2011) [sssd[be[LDAP]]] [ldb] (9): tevent: Added timed event "ltdb_callback": 0x27d3b10 (Mon Mar 28 11:21:41 2011) [sssd[be[LDAP]]] [ldb] (9): tevent: Added timed event "ltdb_timeout": 0x27d3c30 (Mon Mar 28 11:21:41 2011) [sssd[be[LDAP]]] [ldb] (9): tevent: Destroying timer event 0x27d3c30 "ltdb_timeout" (Mon Mar 28 11:21:41 2011) [sssd[be[LDAP]]] [ldb] (9): tevent: Ending timer event 0x27d3b10 "ltdb_callback" (Mon Mar 28 11:21:41 2011) [sssd[be[LDAP]]] [ldb] (9): commit ldb transaction (nesting: 2) (Mon Mar 28 11:21:41 2011) [sssd[be[LDAP]]] [ldb] (9): tevent: Added timed event "ltdb_callback": 0x27b19d0 (Mon Mar 28 11:21:41 2011) [sssd[be[LDAP]]] [ldb] (9): tevent: Added timed event "ltdb_timeout": 0x27d2c00 (Mon Mar 28 11:21:41 2011) [sssd[be[LDAP]]] [ldb] (9): tevent: Destroying timer event 0x27d2c00 "ltdb_timeout" (Mon Mar 28 11:21:41 2011) [sssd[be[LDAP]]] [ldb] (9): tevent: Ending timer event 0x27b19d0 "ltdb_callback" (Mon Mar 28 11:21:41 2011) [sssd[be[LDAP]]] [sysdb_search_user_by_name] (6): Error: 2 (No such file or directory) (Mon Mar 28 11:21:41 2011) [sssd[be[LDAP]]] [sdap_process_group_members_2307] (7): member #151 (annon2756): not found in sysdb (Mon Mar 28 11:21:41 2011) [sssd[be[LDAP]]] [sdap_process_missing_member_2307] (7): Adding a dummy entry (Mon Mar 28 11:21:41 2011) [sssd[be[LDAP]]] [ldb] (9): start ldb transaction (nesting: 2) (Mon Mar 28 11:21:41 2011) [sssd[be[LDAP]]] [ldb] (9): tevent: Added timed event "ltdb_callback": 0x27d3e90 (Mon Mar 28 11:21:41 2011) [sssd[be[LDAP]]] [ldb] (9): tevent: Added timed event "ltdb_timeout": 0x27d3070 (Mon Mar 28 11:21:41 2011) [sssd[be[LDAP]]] [ldb] (9): tevent: Destroying timer event 0x27d3070 "ltdb_timeout" (Mon Mar 28 11:21:41 2011) [sssd[be[LDAP]]] [ldb] (9): tevent: Ending timer event 0x27d3e90 "ltdb_callback" (Mon Mar 28 11:21:41 2011) [sssd[be[LDAP]]] [ldb] (9): commit ldb transaction (nesting: 2) (Mon Mar 28 11:21:41 2011) [sssd[be[LDAP]]] [ldb] (9): tevent: Added timed event "ltdb_callback": 0x27d3c00 (Mon Mar 28 11:21:41 2011) [sssd[be[LDAP]]] [ldb] (9): tevent: Added timed event "ltdb_timeout": 0x27d3a70 (Mon Mar 28 11:21:41 2011) [sssd[be[LDAP]]] [ldb] (9): tevent: Destroying timer event 0x27d3a70 "ltdb_timeout" (Mon Mar 28 11:21:41 2011) [sssd[be[LDAP]]] [ldb] (9): tevent: Ending timer event 0x27d3c00 "ltdb_callback" (Mon Mar 28 11:21:41 2011) [sssd[be[LDAP]]] [sysdb_search_user_by_name] (6): Error: 2 (No such file or directory) ...
I'm please to say it looks like this bug report can be closed. Thanks very much for all the hard work.
Fixed in:
- 97bd239c546febf20008e4e2d8a959b91c738d1f - 95c3131ffeef889399f5afada8b8e47e7892d81f - 1a881a1b5997494ef3228cf35ae0f2c7381d14ff
- 6d4430c2a6ae34012dabb95f82b32f7e8068700b - 50968b666d1b2ca6ee4825b590bfd84ab1e32e55 - 2e59b72c0d8de3183996063c5d39be6b66c446f5
resolution: => fixed status: assigned => closed tests: 0 => 1
rhbz: => 0
Metadata Update from @twhitehead: - Issue assigned to sgallagh - Issue set to the milestone: SSSD 1.5.5
SSSD is moving from Pagure to Github. This means that new issues and pull requests will be accepted only in SSSD's github repository.
This issue has been cloned to Github and is available here: - https://github.com/SSSD/sssd/issues/1869
If you want to receive further updates on the issue, please navigate to the github issue and click on subscribe button.
subscribe
Thank you for understanding. We apologize for all inconvenience.
Login to comment on this ticket.