#50967 ns-slapd is randomly killed on replication
Closed: wontfix 3 years ago by slev. Opened 4 years ago by slev.

I'm running FreeIPA CI on ALTLinux.
There is a long-standing issue.

389-ds: 1.4.1.13

There are random fails (time to time) on replicas with the similar error message:

Mar 20 05:43:24 replica1.ipa.test ns-slapd[5028]: [20/Mar/2020:05:43:24.951956519 +0000] - ERR - NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn - retry (49) the transaction (csn=5e7457f6000600030000) failed (rc=-30994 (DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock))
Mar 20 05:43:25 replica1.ipa.test ns-slapd[5028]: [20/Mar/2020:05:43:25.404967477 +0000] - ERR - NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn - Failed to write entry with csn (5e7457f6000600030000); db error - -30994 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock
Mar 20 05:43:25 replica1.ipa.test ns-slapd[5028]: [20/Mar/2020:05:43:25.407032045 +0000] - ERR - NSMMReplicationPlugin - write_changelog_and_ruv - Can't add a change for uid=testuser2,cn=users,cn=accounts,dc=ipa,dc=test (uniqid: a88cca02-6a6d11ea-ab2cc543-a3b47dd2, optype: 16) to changelog csn 5e7457f6000600030000
Mar 20 05:43:25 replica1.ipa.test ns-slapd[5028]: [20/Mar/2020:05:43:25.413471931 +0000] - ERR - agmt="cn=meTomaster1.ipa.test" (master1:389) - clcache_load_buffer - Can't locate CSN 5e7457f5000200030000 in the changelog (DB rc=-30988). If replication stops, the consumer may need to be reinitialized.
Mar 20 05:43:25 replica1.ipa.test ns-slapd[5028]: Assertion failure: mon->entryCount > 0, at ptsynch.c:648
Mar 20 05:43:25 replica1.ipa.test systemd[1]: dirsrv@IPA-TEST.service: Main process exited, code=killed, status=6/ABRT
Mar 20 05:43:25 replica1.ipa.test systemd[1]: dirsrv@IPA-TEST.service: Failed with result 'signal'.

In contrast to Fedora, ALTLinux's nspr is buit with --enable-debug, which enables all the assertions. The aforementioned function in ptsynch.c:648 is PR_ExitMonitor:

 637 PR_IMPLEMENT(PRStatus) PR_ExitMonitor(PRMonitor *mon)                           
 638 {                                                                               
 639     pthread_t self = pthread_self();                                            
 640     PRIntn rv;                                                                  
 641     PRBool notifyEntryWaiter = PR_FALSE;                                        
 642     PRIntn notifyTimes = 0;                                                     
 643                                                                                 
 644     PR_ASSERT(mon != NULL);                                                     
 645     rv = pthread_mutex_lock(&mon->lock);                                        
 646     PR_ASSERT(0 == rv);                                                         
 647     /* the entries should be > 0 and we'd better be the owner */                
 648     PR_ASSERT(mon->entryCount > 0);                                             
 649     PR_ASSERT(pthread_equal(mon->owner, self));                                 
 650     if (mon->entryCount == 0 || !pthread_equal(mon->owner, self))               
 651     {                                                                           
 652         rv = pthread_mutex_unlock(&mon->lock);                                  
 653         PR_ASSERT(0 == rv);                                                     
 654         return PR_FAILURE;                                                      
 655     }                                                                           
 656     ...

As I see from 389-ds code there is no return check for all of the invocations of PR_ExitMonitor (returns PR_SUCCESS or PR_FAILURE). What is the reason for that?

Where to dig?
Any help would be appreciated.


I think you have some other problems first, your db is deadlocking. :(

What do you think @mreynolds ? It looks like we have a condition where we can try to exit a monitor we don't hold as we are deadlock, and aren't accounting for that ....

I think the reason is that PR_ASSERT is compiled out in non-debug, so wrapping all the monitors in asserts would cause us to have no locks at all. We could use a different assert type, or try to handle these errors better, but C is not a language conducive to clean error handling. And then we have to choose how to actually proceed in this situation, it seems like your server is not in a really healthy state right now ...

Metadata Update from @firstyear:
- Custom field origin adjusted to None
- Custom field reviewstatus adjusted to None

4 years ago

@firstyear, this isn't a production server, but CI. I run FreeIPA integration tests the same as upstream's ones. The difference here is OS. At least, on Fedora 31 I don't see this problem.
I could provide more details (e.g. logs), just tell me, please.

Can you please tell me the version of your libdb package you have on that distribution?

Sure,

# rpm -q --queryformat '%{VERSION}\n' libdb4.7
4.7.25

@slev, the abort looks possibly related to a replication dynamic corner case. An update is failing because DS fails to log it into the changelog and at the same time a replication agreement is fetching the changelog to find the next update to send.
If ABORT is related to replication would you please retry running with replication logging on 'nsslapd-errorlog-level: 8192'.
Also did the ABORT dumped a core or can you configure the box to allow DS to dump a core ? The difficulty being to know which monitor is not balanced and trigger an sigabrt.

@slev also, how frequently does it happen ? Could you test on 1.4.3 to check if is still happens

I think there are possibly two issues here. As @tbordaz there is probably a replication corner case when the BE deadlocks causing this.

The second is that your version of libdb probably doesn't have a series of patches applied. Honestly, I myself need to get the suse packages updated and patched with some fixes that Red Hat have and have released. So maybe it's worth getting in touch with the maintainer of libdb in your distro and then we can get some of these fixes applied.

@firstyear, what is the preferred 389-ds-1.4.3 or Fedora's libdb?
My distro has libdb4.7 and libdb6.1, but I can build the same as Fedora has (5.3.28 + patches).

@slev, is it possible to dump the crashing thread from the core ? Thanks

I think the fedora/centos libdb 5 version with patches is what you want. You can't use libdb6 with 389-ds.

@slev, the problem looks to be UNBALANCED Monitor lock/unlock in entry cache :( . (nice catch !)

Thread 1 (Thread 0x7f4fefdff700 (LWP 5244)):
#0  0x00007f5017029891 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007f501701355b in __GI_abort () at abort.c:79
#2  0x00007f50172bbbad in ??? () at /usr/lib64/libnspr4.so
#3  0x00007f4ff18ce8e0 in ??? ()
#4  0x00007f4fefdfc240 in ??? ()
#5  0x00007f4ff18351c0 in ??? ()
#6  0x00007f4fefdff700 in ??? ()
#7  0x00007f4ff18351c8 in ??? ()
#8  0x00007f50172d1610 in PR_CancelJob (jobp=0x288) at prtpool.c:1077
#9  0x00007f501613e3c8 in ??? ()
#10 0x00007f4fefdfc280 in ??? ()
#11 0x00007f50133aae9d in modify_term (mc=0x7f501613e3c8, mc@entry=0x7f4fefdfc240, be=<optimized out>) at ldap/servers/slapd/back-ldbm/ldbm_modify.c:77
#12 0x00007f501338f978 in ldbm_back_add (pb=0x7f4ff18017c0) at ldap/servers/slapd/back-ldbm/ldbm_add.c:1414
#13 0x00007f501754ad65 in op_shared_add (pb=pb@entry=0x7f4ff18017c0) at ldap/servers/slapd/add.c:689
#14 0x00007f501754b1df in add_internal_pb (pb=pb@entry=0x7f4ff18017c0) at ldap/servers/slapd/add.c:407
#15 0x00007f501754bf77 in slapi_add_internal_pb (pb=pb@entry=0x7f4ff18017c0) at ldap/servers/slapd/add.c:331
#16 0x00007f501332ed92 in mep_add_managed_entry (config=0x7f5012b492f0, origin=0x7f4ff1834740) at ldap/servers/plugins/mep/mep.c:1422
#17 0x00007f50133315ed in mep_add_post_op (pb=0x7f4ff1801580) at ldap/servers/plugins/mep/mep.c:2509
#18 0x00007f50175ac759 in plugin_call_func (list=0x7f501365a900, operation=operation@entry=560, pb=pb@entry=0x7f4ff1801580, call_one=call_one@entry=0) at ldap/servers/slapd/plugin.c:2031
#19 0x00007f50175ac993 in plugin_call_list (pb=0x7f4ff1801580, operation=560, list=<optimized out>) at ldap/servers/slapd/plugin.c:1972
#20 0x00007f50175ac993 in plugin_call_plugins (pb=pb@entry=0x7f4ff1801580, whichfunction=whichfunction@entry=560) at ldap/servers/slapd/plugin.c:442
#21 0x00007f5013390b3a in ldbm_back_add (pb=0x7f4ff1801580) at ldap/servers/slapd/back-ldbm/ldbm_add.c:1208
#22 0x00007f501754ad65 in op_shared_add (pb=pb@entry=0x7f4ff1801580) at ldap/servers/slapd/add.c:689
#23 0x00007f501754bdc2 in do_add (pb=pb@entry=0x7f4ff1801580) at ldap/servers/slapd/add.c:236
#24 0x00000000004199a4 in connection_dispatch_operation (pb=0x7f4ff1801580, op=0x7f5012981800, conn=0x7f500fc5c050) at ldap/servers/slapd/connection.c:623
#25 0x00000000004199a4 in connection_threadmain () at ldap/servers/slapd/connection.c:1785
#26 0x00007f50172d7cce in _pr_poll_with_poll (pds=0x7ffe8380e17e, npds=5244, timeout=295831040) at ptio.c:4111
#27 0x0000000000000000 in ??? ()

More specifically when adding user "uid=testuser2,cn=users,cn=accounts,dc=ipa,dc=test", MEP plugin adds "cn=testuser2,cn=groups,cn=accounts,dc=ipa,dc=test". This MEP add updates the parent entry ("cn=groups,cn=accounts,dc=ipa,dc=test") for numsubordinates.
The MEP ADD failed (ldap_result_code=LDAP_OPERATIONS_ERROR) that is likely a consequence/side effect of DB_DEADLOCK.
During MEP ADD cleanup it release the Monitor lock on the parent entry that is the entry cache but at this time the the Monitor lock have not been acquired !

I think a condition to reproduce is the nice idea to build nspr with --enable_debug. It catches the unbalance Monitor during a DB_deadlock.
Next is how to reproduce the db_deadlock, I think it is not easy. One possibility I can think of is having a large number of replication agreement that can be in competition with a write load. Did you identify a reproducible test case ?

Just for recording, 'parententry' was locked in entry cache (find_entry2modify_only) so I suspect that at some point 'parententry' can be replaced from modify context 'parent_modify_c' with an unlocked entry (possibly modify_switch_entries or modify_unswitch_entries)

Metadata Update from @mreynolds:
- Issue priority set to: major
- Issue set to the milestone: 1.4.1

4 years ago

Metadata Update from @mreynolds:
- Issue set to the milestone: 1.4.3 (was: 1.4.1)

3 years ago

With libdb5.3 the aforementioned deadlock disappears at all, while libdb6.1 introduces another issue. So the former is our choice. Thank you all for the kind support!

Closing.

Metadata Update from @slev:
- Issue close_status updated to: fixed
- Issue status updated to: Closed (was: Open)

3 years ago

@slev we can't use libdb 6 due to AGPL taint.

389-ds-base is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in 389-ds-base's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/389ds/389-ds-base/issues/4020

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata Update from @spichugi:
- Issue close_status updated to: wontfix (was: fixed)

3 years ago

Login to comment on this ticket.

Metadata