#49319 LDAP modify hangs during PKI CA cloning
Closed: wontfix 6 years ago Opened 6 years ago by edewata.

The latest DS on F25 (389-ds-base-1.3.5.17-3.fc25.x86_64) hangs on an LDAP modify operation during PKI CA cloning. Downgrading DS to an older version (389-ds-base-1.3.5.14-1.fc25.x86_64) fixes the problem. The details are documented in PKI Ticket 2780 (https://pagure.io/dogtagpki/issue/2780).

Steps to reproduce:
1. Install CA (http://pki.fedoraproject.org/wiki/Installing_CA)
2. Install CA clone (http://pki.fedoraproject.org/wiki/Installing_CA_Clone)

Actual result: The DS hangs on an LDAP modify operation, causing PKI CA cloning to hang as well.

Expected result: The DS should not hang so the PKI CA cloning should complete successfully.


Hey mate,

This is not enough information to help us diagnose the issue. It would be great if we could see your access log, and dse.ldif to understand what was happening in the server at the time. As well, reach out to @ftweedal who works on IPA-Dogtag, as he has some pretty solid LDAP knowledge and would be able to help us create a reproducer,

Thanks,

Metadata Update from @firstyear:
- Custom field type adjusted to defect

6 years ago

trying to reproduce with the following test case, I failed to create a hang

Master (389-ds-base-1.3.5.17-3.fc25.x86_64)
ipa-server-install --hostname=<master.fqdn> -p Secret123 -a Secret123 --domain test.com --realm TEST.COM -U

Replica (389-ds-base-1.3.5.17-3.fc25.x86_64)
ipa-client-install --domain=test.com --server=<master.fqdn> --realm=TEST.COM -p admin -w Secret123
ipa-replica-install --setup-ca

Hang being systematic, did you identify a test case (with pki commands ? )

I also tried with IPA on F25 and could not reproduce the problem either. However, I was able to use the above PKI CA setup instructions to reproduce the problem consistently on 3 different F25 machines. There's no test case to run since the step #2 above never finishes. As I mentioned before, downgrading the DS fixes the problem.

Also, I was not able to reproduce the problem with the same PKI CA setup instruction on F26. It has a newer DS version.

It would be difficult to create a small reproducer program since the PKI clone setup code is quite complex and does many modifications to the DS before actually executing the LDAP modify operation that hangs.

I have the machine that is still waiting for the PKI CA setup to complete. I'll email the information.

This hang is due to a lock leak that is fixed in https://pagure.io/389-ds-base/issue/49209

The hang symptom is many threads waiting on replica lock while no other thread is holding that lock

Thread 42 (Thread 0x7f32577de700 (LWP 64584)):
#0  0x00007f331d4ed460 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f331db44fa3 in PR_EnterMonitor () from /lib64/libnspr4.so
#2  0x00007f3311982865 in replica_get_ruv () from /usr/lib64/dirsrv/plugins/libreplication-plugin.so
#3  0x00007f331197fd80 in acquire_replica () from /usr/lib64/dirsrv/plugins/libreplication-plugin.so
#4  0x00007f331197882d in repl5_inc_run () from /usr/lib64/dirsrv/plugins/libreplication-plugin.so
#5  0x00007f331197f191 in prot_thread_main () from /usr/lib64/dirsrv/plugins/libreplication-plugin.so
#6  0x00007f331db4a5bc in _pt_root () from /lib64/libnspr4.so
#7  0x00007f331d4e773a in start_thread () from /lib64/libpthread.so.0
#8  0x00007f331d221e0f in clone () from /lib64/libc.so.6
Thread 13 (Thread 0x7f3304ff9700 (LWP 63526)):
#0  0x00007f331d4ed460 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f331db44fa3 in PR_EnterMonitor () from /lib64/libnspr4.so
#2  0x00007f3311983791 in replica_get_generation () from /usr/lib64/dirsrv/plugins/libreplication-plugin.so
#3  0x00007f3311958796 in _cl5GetDBFile () from /usr/lib64/dirsrv/plugins/libreplication-plugin.so
#4  0x00007f331195b64c in cl5GetOperationCount () from /usr/lib64/dirsrv/plugins/libreplication-plugin.so
#5  0x00007f3311987e29 in replica_config_search () from /usr/lib64/dirsrv/plugins/libreplication-plugin.so
#6  0x00007f331f942ca9 in dse_call_callback.isra () from /usr/lib64/dirsrv/libslapd.so.0
#7  0x00007f331f9449a0 in dse_search () from /usr/lib64/dirsrv/libslapd.so.0
#8  0x00007f331f97f560 in op_shared_search () from /usr/lib64/dirsrv/libslapd.so.0
#9  0x00005600a1683637 in do_search ()
#10 0x00005600a167151f in connection_threadmain ()
#11 0x00007f331db4a5bc in _pt_root () from /lib64/libnspr4.so
#12 0x00007f331d4e773a in start_thread () from /lib64/libpthread.so.0
#13 0x00007f331d221e0f in clone () from /lib64/libc.so.6
Thread 10 (Thread 0x7f33067fc700 (LWP 63523)):
#0  0x00007f331d4ed460 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f331db44fa3 in PR_EnterMonitor () from /lib64/libnspr4.so
#2  0x00007f3311983791 in replica_get_generation () from /usr/lib64/dirsrv/plugins/libreplication-plugin.so
#3  0x00007f331197cfd0 in copy_operation_parameters () from /usr/lib64/dirsrv/plugins/libreplication-plugin.so
#4  0x00007f331197e660 in multimaster_preop_modify () from /usr/lib64/dirsrv/plugins/libreplication-plugin.so
#5  0x00007f331f98aee9 in plugin_call_func () from /usr/lib64/dirsrv/libslapd.so.0
#6  0x00007f331f98b154 in plugin_call_plugins () from /usr/lib64/dirsrv/libslapd.so.0
#7  0x00007f331f978641 in op_shared_modify () from /usr/lib64/dirsrv/libslapd.so.0
#8  0x00007f331f979b7c in do_modify () from /usr/lib64/dirsrv/libslapd.so.0
#9  0x00005600a167137d in connection_threadmain ()
#10 0x00007f331db4a5bc in _pt_root () from /lib64/libnspr4.so
#11 0x00007f331d4e773a in start_thread () from /lib64/libpthread.so.0
#12 0x00007f331d221e0f in clone () from /lib64/libc.so.6
Thread 8 (Thread 0x7f33077fe700 (LWP 63521)):
#0  0x00007f331d4ed460 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f331db44fa3 in PR_EnterMonitor () from /lib64/libnspr4.so
#2  0x00007f3311986258 in replica_update_state () from /usr/lib64/dirsrv/plugins/libreplication-plugin.so
#3  0x00007f331f950bfc in eq_loop () from /usr/lib64/dirsrv/libslapd.so.0
#4  0x00007f331db4a5bc in _pt_root () from /lib64/libnspr4.so
#5  0x00007f331d4e773a in start_thread () from /lib64/libpthread.so.0
#6  0x00007f331d221e0f in clone () from /lib64/libc.so.6

Indeed the fix https://pagure.io/389-ds-base/issue/49209 is not in 389-ds-base-1.3.5.17.tar.bz2

The bug https://pagure.io/389-ds-base/issue/49209 was introduced as a side effect of https://pagure.io/389-ds-base/issue/49008.
https://pagure.io/389-ds-base/issue/49008 was fixed in 1.3.5.16-1, that explains why this hang has not occur in 1.3.5.14

no, bug 40209 is not a side effect of #49008. It is a bug in handling the min_csn pending list. that was not changed in 49008

I agree

Actually the bug was introduced in https://fedorahosted.org/389/ticket/48048
(where we may leak the lock if csnplRemove fails).
https://fedorahosted.org/389/ticket/48048 was fixed in 1.3.3.9

-       if (NULL != r->min_csn_pl)
-       {
-               rc = csnplRemove(r->min_csn_pl, csn);
-               PR_ASSERT(rc == 0);
-       }
+    if (NULL != r->min_csn_pl)
+    {
+        int rc = csnplRemove(r->min_csn_pl, csn);
+        if (rc) {
+            slapi_log_error(SLAPI_LOG_FATAL, repl_plugin_name, "csnplRemove failed");
+            return;
+        }

I have not a clear understanding why it did not happen in 1.3.5-14

the hang only occurs if the first operation on a backend (min_csn) fails. but I also don't see what has changed and why it would not hang in 1.3.5

This should now be fixed upstream in 389-ds-base-1.3.5.18-1

@edewata would you please check with the new version that contains #49209.
Closing the ticket, just reopen it if the problem persists in 389-ds-base-1.3.5.18-1

Metadata Update from @tbordaz:
- Issue close_status updated to: duplicate
- Issue status updated to: Closed (was: Open)

6 years ago

Ticket #49209 says the problem happens on IPA, but we weren't able to reproduce it with IPA. Regardless, the new build does fix PKI CA cloning issue. Thanks!

389-ds-base is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in 389-ds-base's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/389ds/389-ds-base/issues/2378

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata Update from @spichugi:
- Issue close_status updated to: wontfix (was: duplicate)

3 years ago

Log in to comment on this ticket.

Metadata