When Dogtag clone is uninstalled and reinstalled on the same slapd instance, the slapd will crash consistently. Reinstalling non-replicated Dogtag does not crash slapd, so this is probably related to replication.
Step to reproduce (see http://pki.fedoraproject.org/wiki/Quick_Start): 1. On master, create DS instance and CA instance. 2. On replica, create DS instance and CA clone. 3. On replica, remove CA clone. 4. On replica, create CA clone again.
During step #4 the slapd will crash and the CA clone installation will be stuck waiting for slapd.
CA clone installation output:
... pkispawn : INFO ....... executing 'systemctl start pki-tomcatd@pki-tomcat.service' pkispawn : INFO ....... constructing PKI configuration data. pkispawn : INFO ....... configuring PKI configuration data.
CA clone debug log:
... [26/Nov/2014:01:17:15][http-bio-8443-exec-3]: Still checking wait_dn 'cn=index1160589769, cn=index, cn=tasks, cn=config' (netscape.ldap.LDAPException: failed to connect to server ldap://vm-064.idm.lab.bos.redhat.com:389 (91))
slapd error log:
... [26/Nov/2014:01:26:54 -0500] - ldbm: Bringing ca offline... [26/Nov/2014:01:26:54 -0500] - ldbm: removing 'ca'. [26/Nov/2014:01:26:54 -0500] - Destructor for instance ca called [26/Nov/2014:01:26:55 -0500] NSMMReplicationPlugin - multimaster_be_state_change: replica dc=ca,dc=example,dc=com is going offline; disabling replication [26/Nov/2014:01:26:55 -0500] NSMMReplicationPlugin - agmt="cn=cloneAgreement1-vm-064.idm.lab.bos.redhat.com-pki-tomcat" (vm-084:389): The remote replica has a different database generation ID than the local database. You may have to reinitialize the remote replica, or the local replica. [26/Nov/2014:01:26:55 -0500] - WARNING: Import is running with nsslapd-db-private-import-mem on; No other process is allowed to access the database [26/Nov/2014:01:26:58 -0500] - import ca: Workers finished; cleaning up... [26/Nov/2014:01:26:58 -0500] - import ca: Workers cleaned up. [26/Nov/2014:01:26:58 -0500] - import ca: Indexing complete. Post-processing... [26/Nov/2014:01:26:58 -0500] - import ca: Generating numSubordinates complete. [26/Nov/2014:01:26:58 -0500] - import ca: Flushing caches... [26/Nov/2014:01:26:58 -0500] - import ca: Closing files... [26/Nov/2014:01:26:58 -0500] - import ca: Import complete. Processed 82 entries in 3 seconds. (27.33 entries/sec) [26/Nov/2014:01:26:58 -0500] NSMMReplicationPlugin - multimaster_be_state_change: replica dc=ca,dc=example,dc=com is coming online; enabling replication
Is a core file available? http://www.port389.org/docs/389ds/FAQ/faq.html#debug_crashes
rpm -q 389-ds-base?
Thanks.
Core is available, but it's rather large (~400MB). Is there a specific info you need? {{{ [root@vm-064 scripts]# rpm -q 389-ds-base 389-ds-base-1.3.2.24-1.fc20.x86_64 }}}
stacktrace.txt stacktrace.txt
access.buf access.buf
I attached the stack trace and the access log buffer from the core file. Please let me know if you need anything else.
Thanks for the stacktraces, Endi.
I'm curious why VLV is called from dse_add... Could you attach your dse.ldif?
And in the thread 1, could print e (adding entry)?
Hi Endi, could you share your DS config info with us? Thanks!
dse.ldif dse.ldif
Hi Endi,
I could start ns-slapd with your dse.ldif. Could you tell me what operations were made against the server? Please note that the access log does not log immediately, but it does buffer the logs. So, this ADD may not be the last operation... Probably, if possible, it might be easier for both of us to let me access your VM and examine the core. If agreed, could you please give me the info via email? Thanks.
1266 [06/Dec/2014:00:04:20 -0500] conn=83 op=18 SRCH base=\"ou=certificateRepository,ou=ca,dc=ca,dc=example,dc=com\" scope=0 filter=\"(|(objectClass=*)(objectClass=ldapsubentry))\" attrs=ALL 1267 [06/Dec/2014:00:04:20 -0500] conn=83 op=18 RESULT err=0 tag=101 nentries=1 etime=0 1268 [06/Dec/2014:00:04:20 -0500] conn=83 op=19 MOD dn=\"ou=certificateRepository,ou=ca,dc=ca,dc=example,dc=com\" 1269 [06/Dec/2014:00:04:20 -0500] conn=83 op=19 RESULT err=0 tag=103 nentries=0 etime=0 csn=54828e56000000560000 1270 [06/Dec/2014:00:04:20 -0500] conn=83 op=20 ADD dn=\"cn=30000001,ou=certificateRepository,ou=ranges,dc=ca,dc=example,dc=com\" 1271 [06/Dec/2014:00:04:20 -0500] conn=83 op=20 RESULT err=0 tag=105 nentries=0 etime=0 csn=54828e56000100560000
I think I recall something like this. As I recall correctly, the workaround was to restart the DS after the clone install. (ie. after step 3). This is what is done in the IPA installation scripts.
When we set up the replica, we do the following: 1. internaldb.ldif (set up database, suffix etc.) 2. index.ldif (add indexes) 3. set up replication agreements and initialize consumer. Wait for replication to complete. This replicates over the data and acls. 4. manager.ldif (set up a db user to be used for connection to the db) 5. post_ldif (set up vlv indexes, vlv tasks) 6. wait_dn --> wait for vlv tasks to be completed.
Noriko,
I emailed you the access to the VM. You can find the LDIF files imported during clone installation in /etc/pki/pki-clone/ca/.
git patch file (master) 0001-Ticket-47966-slapd-crashes-during-Dogtag-clone-reins.patch
Looks good, but you say that there is no chance to unregister the vlv callbacks. So vlv_init is called multiple times, but vlv_close is not? If vlv_close() is called, then the unregistering of the callbacks should happen there.
Replying to [comment:16 mreynolds]:
Sharp eyes, Mark! My comment was wrong... :p (I fixed that part before the holidays. I should have put the comment, then...) I'm replacing the patch with this updated comment. {{{ 2) VLV plugin callbacks are registered in vlv_init. The function could be called multiple times without unregster the VLV plugin callbacks, e.g., from bulk import, which reopens the backend instance. If create an instance and a vlv index, then initialize the database with bulk import, 2 VLV index callbacks are registered. The first set points the old instance address, which is already freed. Note: the unregister callback functions are called only when the instance is deleted in vlv_remove_callbacks via ldbm_instance_unregister_callbacks. The callback is set as the post delete plugin with "(objectclass=nsBackendInstance)". }}}
git patch file (master) -- comment is revised. 0001-Ticket-47966-slapd-crashes-during-Dogtag-clone-reins.2.patch
Replying to [comment:17 nhosoi]:
Replying to [comment:16 mreynolds]: Looks good, but you say that there is no chance to unregister the vlv callbacks. So vlv_init is called multiple times, but vlv_close is not? If vlv_close() is called, then the unregistering of the callbacks should happen there. Sharp eyes, Mark! My comment was wrong... :p (I fixed that part before the holidays. I should have put the comment, then...) I'm replacing the patch with this updated comment.
Sharp eyes, Mark! My comment was wrong... :p (I fixed that part before the holidays. I should have put the comment, then...) I'm replacing the patch with this updated comment.
Thanks for the clarification!
Reviewed by Rich and Mark (Thank you!!)
Pushed to master: 987580d..ebf24ff master -> master commit ebf24ff
Workaround: Restart the server when a backend with VLV is removed. Restart the server once a bulk import is done against the backend with VLV.
Ticket has been cloned to Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1210845
git patch file (master) -- CI test 0001-Ticket-47966-CI-test-added-test-cases-for-ticket-479.patch
Pushed the CI test to master: e7c8da6..8309fd0 master -> master commit 8309fd0
backported fix/CI test to 1.2.11
commit 7930571 --> CI test commit 860be43 --> Fix
6c39c95..a4a3453 389-ds-base-1.3.5 -> 389-ds-base-1.3.5 commit a4a3453 Author: Noriko Hosoi nhosoi@redhat.com Date: Fri Apr 10 14:10:40 2015 -0700
85f4cd6..9f9dcdd 389-ds-base-1.3.4 -> 389-ds-base-1.3.4 commit 9f9dcdd
Metadata Update from @edewata: - Issue assigned to nhosoi - Issue set to the milestone: 1.3.4 backlog
389-ds-base is moving from Pagure to Github. This means that new issues and pull requests will be accepted only in 389-ds-base's github repository.
This issue has been cloned to Github and is available here: - https://github.com/389ds/389-ds-base/issues/1297
If you want to receive further updates on the issue, please navigate to the github issue and click on subscribe button.
subscribe
Thank you for understanding. We apologize for all inconvenience.
Metadata Update from @spichugi: - Issue close_status updated to: wontfix (was: Fixed)
Log in to comment on this ticket.