#48217 cleanallruv hangs shutdown if not all replicas online
Closed: Fixed None Opened 4 years ago by mreynolds.

There are race conditions in some of the cleanallruv code where we can go to sleep without checking if the server is shutting down. Like when checking if replicas are online:

Thread 2 (Thread 0x7f2a4efe5700 (LWP 29721)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:218
#1  0x00007f2a86c73217 in pt_TimedWait (cv=cv@entry=0x7f2a8ab43e78, ml=0x7f2a8ab72e20, timeout=timeout@entry=320000) at ../../../nspr/pr/src/pthreads/ptsynch.c:260
#2  0x00007f2a86c736de in PR_WaitCondVar (cvar=0x7f2a8ab43e70, timeout=320000) at ../../../nspr/pr/src/pthreads/ptsynch.c:387
#3  0x00007f2a7e8282dc in check_agmts_are_alive (replica=0x7f2a8ab8ee40, rid=300, task=0x7f29fc0111f0) at ldap/servers/plugins/replication/repl5_replica_config.c:2275
#4  0x00007f2a7e827015 in replica_cleanallruv_thread (arg=0x7f29fc010bc0) at ldap/servers/plugins/replication/repl5_replica_config.c:1816
#5  0x00007f2a86c78b46 in _pt_root (arg=0x7f29fc014950) at ../../../nspr/pr/src/pthreads/ptthread.c:204
#6  0x00007f2a8661bd14 in start_thread (arg=0x7f2a4efe5700) at pthread_create.c:309
#7  0x00007f2a8613968d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 1 (Thread 0x7f2a88d35800 (LWP 29195)):
#0  0x00007f2a861329f3 in select () at ../sysdeps/unix/syscall-template.S:82
#1  0x00007f2a888df81f in DS_Sleep (ticks=100) at ldap/servers/slapd/util.c:1035
#2  0x00007f2a7e8277d0 in replica_cleanall_ruv_destructor (task=0x7f29fc0111f0) at ldap/servers/plugins/replication/repl5_replica_config.c:1993
#3  0x00007f2a888d4396 in destroy_task (when=0, arg=0x7f29fc0111f0) at ldap/servers/slapd/task.c:621
#4  0x00007f2a888d96c5 in task_shutdown () at ldap/servers/slapd/task.c:2539
#5  0x00007f2a88d86537 in slapd_daemon (ports=0x7fffac8b2e90) at ldap/servers/slapd/daemon.c:1387
#6  0x00007f2a88d8f05d in main (argc=7, argv=0x7fffac8b2fc8) at ldap/servers/slapd/main.c:1115

I don't think that the missing shutdown check before sleeping is a big deal, it will be checked at every iteration in the while() conditions, so there is only a small window you miss, but after sleep the loop will be termimnated. Although an extra check befoe sleeping doesn't hurt.

The problem was the missing stop_ruv_cleaning() calls, that's ok now.

But can the stop_ruv_cleaning() in multimaster_stop be removed ? we could stop the plugin without shutdown (in theory).

Replying to [comment:2 lkrispen]:

I don't think that the missing shutdown check before sleeping is a big deal, it will be checked at every iteration in the while() conditions, so there is only a small window you miss, but after sleep the loop will be termimnated. Although an extra check befoe sleeping doesn't hurt.

I was just being overly cautious :-)

The problem was the missing stop_ruv_cleaning() calls, that's ok now.

Correct.

But can the stop_ruv_cleaning() in multimaster_stop be removed ? we could stop the plugin without shutdown (in theory).

I'm not sure we need the plugin running for cleanallruv to finish, but I can add it back(it doesn't hurt). New patch in the works...

The problem is if slapd is shutting down while it is waiting on a condvar. There needs to be a way that something can detect shutdown and do a notifycondvar to wake up those waits immediately upon shutdown.

Replying to [comment:6 rmeggins]:

The problem is if slapd is shutting down while it is waiting on a condvar. There needs to be a way that something can detect shutdown and do a notifycondvar to wake up those waits immediately upon shutdown.

yes, but Mark's fix does it now. Shutdown was hanging in replica_cleanall_ruv_destructor() and this now calls stop_ruv_cleaning()

Replying to [comment:7 lkrispen]:

Replying to [comment:6 rmeggins]:

The problem is if slapd is shutting down while it is waiting on a condvar. There needs to be a way that something can detect shutdown and do a notifycondvar to wake up those waits immediately upon shutdown.

yes, but Mark's fix does it now. Shutdown was hanging in replica_cleanall_ruv_destructor() and this now calls stop_ruv_cleaning()

ok

fdf4681..d6269f2 master -> master
commit d6269f2
Author: Mark Reynolds mreynolds@redhat.com
Date: Thu Jul 9 09:59:46 2015 -0400

41dff5b..0bb881a 389-ds-base-1.3.4 -> 389-ds-base-1.3.4
commit 0bb881a

This fix introduced a regression. When the server was stopped during a clean task, the task would think everyone was cleaned(when in fact they were not). Need to properly detect the shutdown at the end of the task.

a8130ab..c41d36d master -> master
commit c41d36d
Author: Mark Reynolds mreynolds@redhat.com
Date: Fri Sep 18 11:56:29 2015 -0400

8cd4f45..d9f03f5 389-ds-base-1.3.4 -> 389-ds-base-1.3.4
commit d9f03f5

Metadata Update from @mreynolds:
- Issue assigned to mreynolds
- Issue set to the milestone: 1.3.4.3

2 years ago

Login to comment on this ticket.

Metadata