#618 Crash at shutdown while stopping replica agreements
Closed: Fixed None Opened 6 years ago by tbordaz.

The crash occurs while running the acceptance tests DNA.
The crash occurs while testing a private version (master+own modifications). My modifications are out of replication area and the possibility it triggers the crash are very very low.

The crash is a dynamic issue that occurs because the slapd deamon has shutdown the replication plugin and this one has shut down the replica agreement. But replica agreement that has not yet detected the shutdown was still using its structure. (so this crash is not systematic)

The Stack is looking like:

...
Thread 3 (Thread 0x7f8e16a4a800 (LWP 6141)):
#0  0x00000035d6cea9f3 in select () from /usr/lib64/libc.so.6
#1  0x00007f8e16b33947 in DS_Sleep (ticks=100) at ldap/servers/slapd/util.c:1108
#2  0x00007f8e11c21a57 in _cl5Close () at ldap/servers/plugins/replication/cl5_api.c:3207
#3  0x00007f8e11c1d4d5 in cl5Close () at ldap/servers/plugins/replication/cl5_api.c:585
#4  0x00007f8e11c2bb92 in changelog5_cleanup () at ldap/servers/plugins/replication/cl5_init.c:111
#5  0x00007f8e11c436f6 in multimaster_stop (pb=0x7fff88c951b0) at ldap/servers/plugins/replication/repl5_init.c:755
#6  0x00007f8e16aff6d1 in plugin_call_func (list=0x203ebc0, operation=210, pb=0x7fff88c951b0, call_one=1)
    at ldap/servers/slapd/plugin.c:1453
#7  0x00007f8e16aff5be in plugin_call_one (list=0x203ebc0, operation=210, pb=0x7fff88c951b0) at ldap/servers/slapd/plugin.c:1421
#8  0x00007f8e16aff4d1 in plugin_dependency_closeall () at ldap/servers/slapd/plugin.c:1365
#9  0x00007f8e16aff564 in plugin_closeall (close_backends=1, close_globals=1) at ldap/servers/slapd/plugin.c:1408
#10 0x000000000041988d in slapd_daemon (ports=0x7fff88c956f0) at ldap/servers/slapd/daemon.c:1358
#11 0x0000000000420dab in main (argc=7, argv=0x7fff88c95878) at ldap/servers/slapd/main.c:1266


Thread 1 (Thread 0x7f8e07fff700 (LWP 6150)):
#0  0x000000370ea116a7 in ?? () from /usr/lib64/libnspr4.so
#1  0x000000370ea115fb in ?? () from /usr/lib64/libnspr4.so
#2  0x000000370ea11f8d in ?? () from /usr/lib64/libnspr4.so
#3  0x000000370ea12972 in PR_vsnprintf () from /usr/lib64/libnspr4.so
#4  0x000000370ea12a42 in PR_snprintf () from /usr/lib64/libnspr4.so
#5  0x00007f8e11c37f71 in agmt_set_last_update_status (ra=0x5a333333323830, ldaprc=0, replrc=0, 
    message=0x7f8e11c84a4c "Incremental update succeeded") at ldap/servers/plugins/replication/repl5_agmt.c:2181
#6  0x00007f8e11c3fc26 in repl5_inc_run (prp=0x22ebf00) at ldap/servers/plugins/replication/repl5_inc_protocol.c:1041
#7  0x00007f8e11c468d8 in prot_thread_main (arg=0x22cf120) at ldap/servers/plugins/replication/repl5_protocol.c:295
#8  0x000000370ea28cf3 in ?? () from /usr/lib64/libnspr4.so
#9  0x00000035d7007d14 in start_thread () from /usr/lib64/libpthread.so.0
#10 0x00000035d6cf168d in clone () from /usr/lib64/libc.so.6
...
(gdb) thread
[Current thread is 1 (Thread 0x7f8e07fff700 (LWP 6150))]
(gdb) frame
#6  0x00007f8e11c3fc26 in repl5_inc_run (prp=0x22ebf00) at ldap/servers/plugins/replication/repl5_inc_protocol.c:1041
1041                              agmt_set_last_update_status(prp->agmt, 0, 0, "Incremental update succeeded");
(gdb) print prp->agmt
$5 = (Repl_Agmt *) 0x5a333333323830   <=== invalid pointer
(gdb) print *prp
$6 = {delete = 0x22bf7d0, run = 0, stop = 0, status = 0x31, notify_update = 0x22b9740, notify_agmt_changed = 0, 
  notify_window_opened = 0, notify_window_closed = 0, update_now = 0x30, lock = 0x21, cvar = 0x6954796669646f6d, 
  stopped = 1953719661, terminate = 7368033, eventbits = 36, conn = 0x21, last_acquire_response_code = 858861618, 
  agmt = 0x5a333333323830, replica_object = 0x0, private = 0x21, replica_acquired = 36431920, repl50consumer = 0, 
  repl71consumer = 1835093619, repl90consumer = 101, timeout = 0}

The solution is looking like the replica agreement protocol shutdown function should wait for the RA thread to complete.

Note: it is note clear to me if I have to attach the core+RPM to that ticket


'''Here is the current status'''

  • This bug applies to all releases (1.2.10, 1.2.11, 1.3.0, master, rhel-6.3 and rhel-6.4)[[BR]]
    the issue is that stopping the RA protocol DS waits for a grace delay to see the protocol being stopped. In case RA is long to detect the shutdown, the grace delay expires and the data structure will be freed.

  • The impact of the bug is a crash at shutdown. That will trigger a recovery at startup. That means startup will be delayed by the time of the recovery. An additional risk is the need to reinit the server if recovery fails.

  • The bug occurs on rare dynamic situation

'''Here are the next step'''

  • Investigate a fix. [[BR]]
    Note: I suggest to join the RA thread after having asked the protocol stop.

git merge ticket618
Updating 4b7791f..957bea3
Fast-forward
ldap/servers/plugins/replication/repl5_protocol.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)

git push origin master
Counting objects: 13, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (7/7), done.
Writing objects: 100% (7/7), 1.01 KiB, done.
Total 7 (delta 5), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
4b7791f..957bea3 master -> master

commit 957bea3
Author: Thierry bordaz (tbordaz) tbordaz@redhat.com
Date: Thu Mar 21 11:11:34 2013 +0100

Bug description: coverity reported "13159 - Dereference after null check"

Fix description: call PR_JoinThread inside of (NULL != rp) clause.

Pushed to master: commit 88ef8cb (Ticket #618)

Here is the current status

Metadata Update from @nkinder:
- Issue assigned to tbordaz
- Issue set to the milestone: 1.3.1

2 years ago

Login to comment on this ticket.

Metadata