There are race conditions in some of the cleanallruv code where we can go to sleep without checking if the server is shutting down. Like when checking if replicas are online:
Thread 2 (Thread 0x7f2a4efe5700 (LWP 29721)): #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:218 #1 0x00007f2a86c73217 in pt_TimedWait (cv=cv@entry=0x7f2a8ab43e78, ml=0x7f2a8ab72e20, timeout=timeout@entry=320000) at ../../../nspr/pr/src/pthreads/ptsynch.c:260 #2 0x00007f2a86c736de in PR_WaitCondVar (cvar=0x7f2a8ab43e70, timeout=320000) at ../../../nspr/pr/src/pthreads/ptsynch.c:387 #3 0x00007f2a7e8282dc in check_agmts_are_alive (replica=0x7f2a8ab8ee40, rid=300, task=0x7f29fc0111f0) at ldap/servers/plugins/replication/repl5_replica_config.c:2275 #4 0x00007f2a7e827015 in replica_cleanallruv_thread (arg=0x7f29fc010bc0) at ldap/servers/plugins/replication/repl5_replica_config.c:1816 #5 0x00007f2a86c78b46 in _pt_root (arg=0x7f29fc014950) at ../../../nspr/pr/src/pthreads/ptthread.c:204 #6 0x00007f2a8661bd14 in start_thread (arg=0x7f2a4efe5700) at pthread_create.c:309 #7 0x00007f2a8613968d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 1 (Thread 0x7f2a88d35800 (LWP 29195)): #0 0x00007f2a861329f3 in select () at ../sysdeps/unix/syscall-template.S:82 #1 0x00007f2a888df81f in DS_Sleep (ticks=100) at ldap/servers/slapd/util.c:1035 #2 0x00007f2a7e8277d0 in replica_cleanall_ruv_destructor (task=0x7f29fc0111f0) at ldap/servers/plugins/replication/repl5_replica_config.c:1993 #3 0x00007f2a888d4396 in destroy_task (when=0, arg=0x7f29fc0111f0) at ldap/servers/slapd/task.c:621 #4 0x00007f2a888d96c5 in task_shutdown () at ldap/servers/slapd/task.c:2539 #5 0x00007f2a88d86537 in slapd_daemon (ports=0x7fffac8b2e90) at ldap/servers/slapd/daemon.c:1387 #6 0x00007f2a88d8f05d in main (argc=7, argv=0x7fffac8b2fc8) at ldap/servers/slapd/main.c:1115
I don't think that the missing shutdown check before sleeping is a big deal, it will be checked at every iteration in the while() conditions, so there is only a small window you miss, but after sleep the loop will be termimnated. Although an extra check befoe sleeping doesn't hurt.
The problem was the missing stop_ruv_cleaning() calls, that's ok now.
But can the stop_ruv_cleaning() in multimaster_stop be removed ? we could stop the plugin without shutdown (in theory).
Replying to [comment:2 lkrispen]:
I was just being overly cautious :-)
Correct.
I'm not sure we need the plugin running for cleanallruv to finish, but I can add it back(it doesn't hurt). New patch in the works...
revision 0001-Ticket-48217-cleanAllRUV-hangs-shutdown-if-not-all-o.patch
New patch attached...
The problem is if slapd is shutting down while it is waiting on a condvar. There needs to be a way that something can detect shutdown and do a notifycondvar to wake up those waits immediately upon shutdown.
Replying to [comment:6 rmeggins]:
yes, but Mark's fix does it now. Shutdown was hanging in replica_cleanall_ruv_destructor() and this now calls stop_ruv_cleaning()
Replying to [comment:7 lkrispen]:
Replying to [comment:6 rmeggins]: The problem is if slapd is shutting down while it is waiting on a condvar. There needs to be a way that something can detect shutdown and do a notifycondvar to wake up those waits immediately upon shutdown. yes, but Mark's fix does it now. Shutdown was hanging in replica_cleanall_ruv_destructor() and this now calls stop_ruv_cleaning()
ok
fdf4681..d6269f2 master -> master commit d6269f2 Author: Mark Reynolds mreynolds@redhat.com Date: Thu Jul 9 09:59:46 2015 -0400
41dff5b..0bb881a 389-ds-base-1.3.4 -> 389-ds-base-1.3.4 commit 0bb881a
Ticket has been cloned to Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1241723
This fix introduced a regression. When the server was stopped during a clean task, the task would think everyone was cleaned(when in fact they were not). Need to properly detect the shutdown at the end of the task.
Fix regression with server shutdown 0001-Ticket-48217-cleanallruv-fix-regression-with-server-.patch
a8130ab..c41d36d master -> master commit c41d36d Author: Mark Reynolds mreynolds@redhat.com Date: Fri Sep 18 11:56:29 2015 -0400
8cd4f45..d9f03f5 389-ds-base-1.3.4 -> 389-ds-base-1.3.4 commit d9f03f5
Metadata Update from @mreynolds: - Issue assigned to mreynolds - Issue set to the milestone: 1.3.4.3
389-ds-base is moving from Pagure to Github. This means that new issues and pull requests will be accepted only in 389-ds-base's github repository.
This issue has been cloned to Github and is available here: - https://github.com/389ds/389-ds-base/issues/1548
If you want to receive further updates on the issue, please navigate to the github issue and click on subscribe button.
subscribe
Thank you for understanding. We apologize for all inconvenience.
Metadata Update from @spichugi: - Issue close_status updated to: wontfix (was: Fixed)
Login to comment on this ticket.