#47871 389-ds-base-1.3.2.21-1.fc20 crashed over the weekend
Closed: Fixed None Opened 5 years ago by mkosek.

On Aug 8th I run installed FreeIPA server with 389-ds-base-1.3.2.21-1.fc20 which finished OK. The instance was left running when it suddenly crashed over the weekend:

# systemctl status dirsrv@MKOSEK-FEDORA20-TEST.service 
dirsrv@MKOSEK-FEDORA20-TEST.service - 389 Directory Server MKOSEK-FEDORA20-TEST.
   Loaded: loaded (/usr/lib/systemd/system/dirsrv@.service; enabled)
   Active: failed (Result: signal) since Sun 2014-08-10 11:43:40 CEST; 1 day 21h ago
 Main PID: 7656 (code=killed, signal=SEGV)

Aug 10 11:13:11 ipa.mkosek-fedora20.test ns-slapd[7656]: GSSAPI server step 2
Aug 10 11:13:11 ipa.mkosek-fedora20.test ns-slapd[7656]: GSSAPI server step 3
Aug 10 11:28:11 ipa.mkosek-fedora20.test ns-slapd[7656]: GSSAPI server step 1
Aug 10 11:28:12 ipa.mkosek-fedora20.test ns-slapd[7656]: GSSAPI server step 2
Aug 10 11:28:12 ipa.mkosek-fedora20.test ns-slapd[7656]: GSSAPI server step 3
Aug 10 11:43:12 ipa.mkosek-fedora20.test ns-slapd[7656]: GSSAPI server step 1
Aug 10 11:43:13 ipa.mkosek-fedora20.test ns-slapd[7656]: GSSAPI server step 2
Aug 10 11:43:13 ipa.mkosek-fedora20.test ns-slapd[7656]: GSSAPI server step 3
Aug 10 11:43:40 ipa.mkosek-fedora20.test systemd[1]: dirsrv@MKOSEK-FEDORA20-TEST.service: main pro...EGV
Aug 10 11:43:40 ipa.mkosek-fedora20.test systemd[1]: Unit dirsrv@MKOSEK-FEDORA20-TEST.service ente...te.
Hint: Some lines were ellipsized, use -l to show in full.

Stacktrace attached.


it seems the crash is happening because getting the entry for a changenumber returns a corrputed entry, changelog trimming doesn't check for errors in that place.
Is the error log still available, does the server start and could you perform a search on cn=changelog ?

No, in the error log, I only see this last entry in /var/log/dirsrv/slapd-MKOSEK-FEDORA20-TEST/errors.20140804-102138:
{{{
[08/Aug/2014:11:53:02 +0200] ipapwd_pre_bind - [file prepost.c, line 1437]: kerberos principal in uid=tuser1,cn=users,cn=accounts,dc=mkosek-fedora20,dc=test is expired
[08/Aug/2014:11:53:13 +0200] - Entry "uid=tuser1,cn=users,cn=accounts,dc=mkosek-fedora20,dc=test" -- attribute "random" not allowed
}}}

and then just starting the new DS in /var/log/dirsrv/slapd-MKOSEK-FEDORA20-TEST/errors, though with a database error:
{{{
389-Directory/1.3.2.21 B2014.219.1850
ipa.mkosek-fedora20.test:636 (/etc/dirsrv/slapd-MKOSEK-FEDORA20-TEST)

[12/Aug/2014:10:41:36 +0200] - 389-Directory/1.3.2.21 B2014.219.1850 starting up
[12/Aug/2014:10:41:36 +0200] - WARNING: changelog: entry cache size 2097152B is less than db size 9732096B; We recommend to increase the entry cache size nsslapd-cachememsize.
[12/Aug/2014:10:41:36 +0200] - Detected Disorderly Shutdown last time Directory Server was running, recovering database.
[12/Aug/2014:10:41:38 +0200] schema-compat-plugin - warning: no entries set up under cn=computers, cn=compat,dc=mkosek-fedora20,dc=test
[12/Aug/2014:10:41:38 +0200] schema-compat-plugin - warning: no entries set up under cn=ng, cn=compat,dc=mkosek-fedora20,dc=test
[12/Aug/2014:10:41:38 +0200] schema-compat-plugin - warning: no entries set up under ou=sudoers,dc=mkosek-fedora20,dc=test
[12/Aug/2014:10:41:38 +0200] NSACLPlugin - The ACL target ou=sudoers,dc=mkosek-fedora20,dc=test does not exist
[12/Aug/2014:10:41:38 +0200] NSACLPlugin - The ACL target cn=ad,cn=etc,dc=mkosek-fedora20,dc=test does not exist
[12/Aug/2014:10:41:38 +0200] NSACLPlugin - The ACL target cn=casigningcert cert-pki-ca,cn=ca_renewal,cn=ipa,cn=etc,dc=mkosek-fedora20,dc=test does not exist
[12/Aug/2014:10:41:38 +0200] NSACLPlugin - The ACL target cn=casigningcert cert-pki-ca,cn=ca_renewal,cn=ipa,cn=etc,dc=mkosek-fedora20,dc=test does not exist
[12/Aug/2014:10:41:38 +0200] NSACLPlugin - The ACL target cn=automember rebuild membership,cn=tasks,cn=config does not exist
[12/Aug/2014:10:41:38 +0200] - Skipping CoS Definition cn=Password Policy,cn=accounts,dc=mkosek-fedora20,dc=test--no CoS Templates found, which should be added before the CoS Definition.
[12/Aug/2014:10:41:38 +0200] - Skipping CoS Definition cn=Password Policy,cn=accounts,dc=mkosek-fedora20,dc=test--no CoS Templates found, which should be added before the CoS Definition.
[12/Aug/2014:10:41:38 +0200] - slapd started. Listening on All Interfaces port 389 for LDAP requests
[12/Aug/2014:10:41:38 +0200] - Listening on All Interfaces port 636 for LDAPS requests
[12/Aug/2014:10:41:38 +0200] - Listening on /var/run/slapd-MKOSEK-FEDORA20-TEST.socket for LDAPI requests
[12/Aug/2014:10:42:09 +0200] - database error -30993
[12/Aug/2014:10:42:09 +0200] - database error -30993
}}}

DS did not crash during ldapsearch:
{{{

ldapsearch -D "cn=Directory Manager" -x -w "kokos123" -b cn=changelog

...

7837, changelog

dn: changenumber=7837,cn=changelog
objectClass: top
objectClass: changelogentry
objectClass: extensibleObject
targetuniqueid: 72c4d509-1edf11e4-9b00fe62-30505311
changeNumber: 7837
targetDn: fqdn=ipa.mkosek-fedora20.test,cn=computers,cn=accounts,dc=mkosek-fed
ora20,dc=test
changeTime: 20140812084259Z
changeType: modify
changes:: cmVwbGFjZToga3JiTGFzdFN1Y2Nlc3NmdWxBdXRoCmtyYkxhc3RTdWNjZXNzZnVsQXV0
aDogMjAxNDA4MTIwODQzMDBaCi0KcmVwbGFjZTogbW9kaWZpZXJzbmFtZQptb2RpZmllcnNuYW1lO
iBjbj1JUEEgTG9ja291dCxjbj1wbHVnaW5zLGNuPWNvbmZpZwotCnJlcGxhY2U6IG1vZGlmeXRpbW
VzdGFtcAptb2RpZnl0aW1lc3RhbXA6IDIwMTQwODEyMDg0MjU5WgotCnJlcGxhY2U6IGVudHJ5dXN
uCmVudHJ5dXNuOiAyMjI4MAotCgA=

search result

search: 2
result: 0 Success

numResponses: 31

numEntries: 30

}}}

I wonder if there is not a bug in handle_getchangerecord_search. This callback function stores the entry return by the search into a temporary structure (crt->crt_entry). When returning from the internal search, get_changerecord returns that entry.

According to https://access.redhat.com/documentation/en-US/Red_Hat_Directory_Server/8.1/html/Plug-in_Guide/Plugin_Programming_Guide-Function_Reference-slapi_search_internal_callback_pb.html, if we want to access the entry outside of the callback we need to do slapi_entry_dup.

I set up 3 freeipa instances on F20, two with the fix, one without.

Ran freeipa unit tests, then let them idle (this was the conditions when the crashed occurred). To increase the probability of crash, it trims the retro changelog each 30s and it evaluates ~ 5000 entries at each trimming.

{{{
dn: cn=Retro Changelog Plugin,cn=plugins,cn=config
nsslapd-changelogmaxage: 8d
nsslapd-changelog-trim-interval: 30s

}}}

The tests started Friday 22nd. No crash after 3 days.

Push to '''master'''

git merge ticket47871
Updating 1c888bf..495151b
Fast-forward
dirsrvtests/tickets/ticket47871_test.py | 306 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
ldap/servers/plugins/retrocl/retrocl.h | 2 +-
ldap/servers/plugins/retrocl/retrocl_trim.c | 163 +++++++++++++++++++++++--------------------
3 files changed, 393 insertions(+), 78 deletions(-)
create mode 100644 dirsrvtests/tickets/ticket47871_test.py

git push origin master
Counting objects: 20, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (11/11), done.
Writing objects: 100% (11/11), 5.29 KiB, done.
Total 11 (delta 7), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
1c888bf..495151b master -> master

commit 495151b
Author: Thierry bordaz (tbordaz) tbordaz@redhat.com
Date: Thu Aug 21 11:49:18 2014 +0200

Push to '''1.3.2'''

git cherry-pick 495151b
[389-ds-base-1.3.2 47fbe25] Ticket 47871 - 389-ds-base-1.3.2.21-1.fc20 crashed over the weekend
3 files changed, 393 insertions(+), 78 deletions(-)
create mode 100644 dirsrvtests/tickets/ticket47871_test.py

git push origin 389-ds-base-1.3.2
Counting objects: 20, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (10/10), done.
Writing objects: 100% (11/11), 5.17 KiB, done.
Total 11 (delta 7), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
f237812..47fbe25 389-ds-base-1.3.2 -> 389-ds-base-1.3.2

commit 47fbe25
Author: Thierry bordaz (tbordaz) tbordaz@redhat.com
Date: Thu Aug 21 11:49:18 2014 +0200

adding the crashing thread backstack to help to retrieve that bug {{{ Thread 1 (Thread 0x7f83a2cb2700 (LWP 30103)): #0 attrlist_find (a=0x105, type=0x7f8395818716 "changetime") at ldap/servers/slapd/attrlist.c:144 #1 0x00007f83a286bec2 in slapi_entry_attr_find (e=0x105, type=0x7f8395818716 "changetime", a=0x7f83a2cb1d58) at ldap/servers/slapd/entry.c:2362 #2 0x00007f83958176b0 in trim_changelog () at ldap/servers/plugins/retrocl/retrocl_trim.c:287 #3 changelog_trim_thread_fn (arg=<optimized out="">) at ldap/servers/plugins/retrocl/retrocl_trim.c:348 #4 0x00007f83a0c9be5b in _pt_root (arg=0x7f837c164fb0) at ../../../nspr/pr/src/pthreads/ptthread.c:212 #5 0x00007f83a063bf33 in start_thread (arg=0x7f83a2cb2700) at pthread_create.c:309 #6 0x00007f83a0369ded in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 }}}

Metadata Update from @nhosoi:
- Issue assigned to tbordaz
- Issue set to the milestone: 1.3.2.23

2 years ago

Login to comment on this ticket.

Metadata