#47966 slapd crashes during Dogtag clone reinstallation
Closed: wontfix None Opened 6 years ago by edewata.

When Dogtag clone is uninstalled and reinstalled on the same slapd instance, the slapd will crash consistently. Reinstalling non-replicated Dogtag does not crash slapd, so this is probably related to replication.

Step to reproduce (see http://pki.fedoraproject.org/wiki/Quick_Start):
1. On master, create DS instance and CA instance.
2. On replica, create DS instance and CA clone.
3. On replica, remove CA clone.
4. On replica, create CA clone again.

During step #4 the slapd will crash and the CA clone installation will be stuck waiting for slapd.

CA clone installation output:

...
pkispawn    : INFO     ....... executing 'systemctl start pki-tomcatd@pki-tomcat.service'
pkispawn    : INFO     ....... constructing PKI configuration data.
pkispawn    : INFO     ....... configuring PKI configuration data.

CA clone debug log:

...
[26/Nov/2014:01:17:15][http-bio-8443-exec-3]: Still checking wait_dn 'cn=index1160589769, cn=index, cn=tasks, cn=config' (netscape.ldap.LDAPException: failed to connect to server ldap://vm-064.idm.lab.bos.redhat.com:389 (91))

slapd error log:

...
[26/Nov/2014:01:26:54 -0500] - ldbm: Bringing ca offline...
[26/Nov/2014:01:26:54 -0500] - ldbm: removing 'ca'.
[26/Nov/2014:01:26:54 -0500] - Destructor for instance ca called
[26/Nov/2014:01:26:55 -0500] NSMMReplicationPlugin - multimaster_be_state_change: replica dc=ca,dc=example,dc=com is going offline; disabling replication
[26/Nov/2014:01:26:55 -0500] NSMMReplicationPlugin - agmt="cn=cloneAgreement1-vm-064.idm.lab.bos.redhat.com-pki-tomcat" (vm-084:389): The remote replica has a different database generation ID than the local database.  You may have to reinitialize the remote replica, or the local replica.
[26/Nov/2014:01:26:55 -0500] - WARNING: Import is running with nsslapd-db-private-import-mem on; No other process is allowed to access the database
[26/Nov/2014:01:26:58 -0500] - import ca: Workers finished; cleaning up...
[26/Nov/2014:01:26:58 -0500] - import ca: Workers cleaned up.
[26/Nov/2014:01:26:58 -0500] - import ca: Indexing complete.  Post-processing...
[26/Nov/2014:01:26:58 -0500] - import ca: Generating numSubordinates complete.
[26/Nov/2014:01:26:58 -0500] - import ca: Flushing caches...
[26/Nov/2014:01:26:58 -0500] - import ca: Closing files...
[26/Nov/2014:01:26:58 -0500] - import ca: Import complete.  Processed 82 entries in 3 seconds. (27.33 entries/sec)
[26/Nov/2014:01:26:58 -0500] NSMMReplicationPlugin - multimaster_be_state_change: replica dc=ca,dc=example,dc=com is coming online; enabling replication

Core is available, but it's rather large (~400MB). Is there a specific info you need?
{{{
[root@vm-064 scripts]# rpm -q 389-ds-base
389-ds-base-1.3.2.24-1.fc20.x86_64
}}}

I attached the stack trace and the access log buffer from the core file. Please let me know if you need anything else.

Thanks for the stacktraces, Endi.

I'm curious why VLV is called from dse_add... Could you attach your dse.ldif?

And in the thread 1, could print e (adding entry)?

Hi Endi, could you share your DS config info with us? Thanks!

Hi, sorry for the delay. My VM was being used for other work. Here is the gdb output: {{{ (gdb) bt #0 0x00007fbf2c552b4e in vlvSearch_init (p=0x7fbf20001740, pb=pb@entry=0x7fbf057e9ae0, e=e@entry=0x7fbf20012ed0, inst=inst@entry=0x7fbef805ba60) at ldap/servers/slapd/back-ldbm/vlv_srch.c:189 #1 0x00007fbf2c54e946 in vlv_AddSearchEntry (pb=0x7fbf057e9ae0, entryBefore=0x7fbf20012ed0, entryAfter=<optimized out>, returncode=<optimized out>, returntext=<optimized out>, arg=0x7fbef805ba60) at ldap/servers/slapd/back-ldbm/vlv.c:83 #2 0x00007fbf367984eb in dse_call_callback (pb=pb@entry=0x7fbf057e9ae0, operation=operation@entry=16, flags=flags@entry=1, entryBefore=0x7fbf20012ed0, entryAfter=entryAfter@entry=0x0, returncode=returncode@entry=0x7fbf057e5534, returntext=returntext@entry=0x7fbf057e5590 "", pdse=<optimized out>) at ldap/servers/slapd/dse.c:2421 #3 0x00007fbf3679afbf in dse_add (pb=0x7fbf057e9ae0) at ldap/servers/slapd/dse.c:2171 #4 0x00007fbf36784c4a in op_shared_add (pb=pb@entry=0x7fbf057e9ae0) at ldap/servers/slapd/add.c:740 #5 0x00007fbf36785f90 in do_add (pb=pb@entry=0x7fbf057e9ae0) at ldap/servers/slapd/add.c:258 #6 0x00007fbf36ca1dc4 in connection_dispatch_operation (pb=0x7fbf057e9ae0, op=0x7fbf38a4e4b0, conn=0x7fbf36b6c3d0) at ldap/servers/slapd/connection.c:645 #7 connection_threadmain () at ldap/servers/slapd/connection.c:2534 #8 0x00007fbf34bcde3b in _pt_root (arg=0x7fbf38a4b9f0) at ../../../nspr/pr/src/pthreads/ptthread.c:212 #9 0x00007fbf3456df33 in start_thread (arg=0x7fbf057ea700) at pthread_create.c:309 #10 0x00007fbf3429bead in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 (gdb) print e $1 = (struct backentry *) 0x0 }}}

Hi Endi,

I could start ns-slapd with your dse.ldif. Could you tell me what operations were made against the server? Please note that the access log does not log immediately, but it does buffer the logs. So, this ADD may not be the last operation... Probably, if possible, it might be easier for both of us to let me access your VM and examine the core. If agreed, could you please give me the info via email? Thanks.

1266 [06/Dec/2014:00:04:20 -0500] conn=83 op=18 SRCH base=\"ou=certificateRepository,ou=ca,dc=ca,dc=example,dc=com\" scope=0 filter=\"(|(objectClass=*)(objectClass=ldapsubentry))\" attrs=ALL
1267 [06/Dec/2014:00:04:20 -0500] conn=83 op=18 RESULT err=0 tag=101 nentries=1 etime=0
1268 [06/Dec/2014:00:04:20 -0500] conn=83 op=19 MOD dn=\"ou=certificateRepository,ou=ca,dc=ca,dc=example,dc=com\"
1269 [06/Dec/2014:00:04:20 -0500] conn=83 op=19 RESULT err=0 tag=103 nentries=0 etime=0 csn=54828e56000000560000
1270 [06/Dec/2014:00:04:20 -0500] conn=83 op=20 ADD dn=\"cn=30000001,ou=certificateRepository,ou=ranges,dc=ca,dc=example,dc=com\"
1271 [06/Dec/2014:00:04:20 -0500] conn=83 op=20 RESULT err=0 tag=105 nentries=0 etime=0 csn=54828e56000100560000

I think I recall something like this. As I recall correctly, the workaround was to restart the DS after the clone install. (ie. after step 3). This is what is done in the IPA installation scripts.

When we set up the replica, we do the following:
1. internaldb.ldif (set up database, suffix etc.)
2. index.ldif (add indexes)
3. set up replication agreements and initialize consumer. Wait for replication to complete.
This replicates over the data and acls.
4. manager.ldif (set up a db user to be used for connection to the db)
5. post_ldif (set up vlv indexes, vlv tasks)
6. wait_dn --> wait for vlv tasks to be completed.

Noriko,

I emailed you the access to the VM. You can find the LDIF files imported during clone installation in /etc/pki/pki-clone/ca/.

Looks good, but you say that there is no chance to unregister the vlv callbacks. So vlv_init is called multiple times, but vlv_close is not? If vlv_close() is called, then the unregistering of the callbacks should happen there.

Replying to [comment:16 mreynolds]:

Looks good, but you say that there is no chance to unregister the vlv callbacks. So vlv_init is called multiple times, but vlv_close is not? If vlv_close() is called, then the unregistering of the callbacks should happen there.

Sharp eyes, Mark! My comment was wrong... :p (I fixed that part before the holidays. I should have put the comment, then...) I'm replacing the patch with this updated comment.
{{{
2) VLV plugin callbacks are registered in vlv_init. The function could be
called multiple times without unregster the VLV plugin callbacks, e.g.,
from bulk import, which reopens the backend instance. If create an
instance and a vlv index, then initialize the database with bulk import,
2 VLV index callbacks are registered. The first set points the old instance
address, which is already freed.
Note: the unregister callback functions are called only when the instance
is deleted in vlv_remove_callbacks via ldbm_instance_unregister_callbacks.
The callback is set as the post delete plugin with "(objectclass=nsBackendInstance)".
}}}

Replying to [comment:17 nhosoi]:

Replying to [comment:16 mreynolds]:

Looks good, but you say that there is no chance to unregister the vlv callbacks. So vlv_init is called multiple times, but vlv_close is not? If vlv_close() is called, then the unregistering of the callbacks should happen there.

Sharp eyes, Mark! My comment was wrong... :p (I fixed that part before the holidays. I should have put the comment, then...) I'm replacing the patch with this updated comment.

Thanks for the clarification!

Reviewed by Rich and Mark (Thank you!!)

Pushed to master:
987580d..ebf24ff master -> master
commit ebf24ff

Workaround:
Restart the server when a backend with VLV is removed.
Restart the server once a bulk import is done against the backend with VLV.

Pushed the CI test to master:
e7c8da6..8309fd0 master -> master
commit 8309fd0

backported fix/CI test to 1.2.11

commit 7930571 --> CI test
commit 860be43 --> Fix

6c39c95..a4a3453 389-ds-base-1.3.5 -> 389-ds-base-1.3.5
commit a4a3453
Author: Noriko Hosoi nhosoi@redhat.com
Date: Fri Apr 10 14:10:40 2015 -0700

85f4cd6..9f9dcdd 389-ds-base-1.3.4 -> 389-ds-base-1.3.4
commit 9f9dcdd

Metadata Update from @edewata:
- Issue assigned to nhosoi
- Issue set to the milestone: 1.3.4 backlog

3 years ago

389-ds-base is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in 389-ds-base's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/389ds/389-ds-base/issues/1297

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata Update from @spichugi:
- Issue close_status updated to: wontfix (was: Fixed)

4 months ago

Login to comment on this ticket.

Metadata