Issue #50967: ns-slapd is randomly killed on replication - 389-ds-base

389-ds-base

#50967 ns-slapd is randomly killed on replication

Closed: wontfix 3 years ago by slev. Opened 4 years ago by slev.

I'm running FreeIPA CI on ALTLinux.
There is a long-standing issue.

389-ds: 1.4.1.13

There are random fails (time to time) on replicas with the similar error message:

Mar 20 05:43:24 replica1.ipa.test ns-slapd[5028]: [20/Mar/2020:05:43:24.951956519 +0000] - ERR - NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn - retry (49) the transaction (csn=5e7457f6000600030000) failed (rc=-30994 (DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock))
Mar 20 05:43:25 replica1.ipa.test ns-slapd[5028]: [20/Mar/2020:05:43:25.404967477 +0000] - ERR - NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn - Failed to write entry with csn (5e7457f6000600030000); db error - -30994 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock
Mar 20 05:43:25 replica1.ipa.test ns-slapd[5028]: [20/Mar/2020:05:43:25.407032045 +0000] - ERR - NSMMReplicationPlugin - write_changelog_and_ruv - Can't add a change for uid=testuser2,cn=users,cn=accounts,dc=ipa,dc=test (uniqid: a88cca02-6a6d11ea-ab2cc543-a3b47dd2, optype: 16) to changelog csn 5e7457f6000600030000
Mar 20 05:43:25 replica1.ipa.test ns-slapd[5028]: [20/Mar/2020:05:43:25.413471931 +0000] - ERR - agmt="cn=meTomaster1.ipa.test" (master1:389) - clcache_load_buffer - Can't locate CSN 5e7457f5000200030000 in the changelog (DB rc=-30988). If replication stops, the consumer may need to be reinitialized.
Mar 20 05:43:25 replica1.ipa.test ns-slapd[5028]: Assertion failure: mon->entryCount > 0, at ptsynch.c:648
Mar 20 05:43:25 replica1.ipa.test systemd[1]: dirsrv@IPA-TEST.service: Main process exited, code=killed, status=6/ABRT
Mar 20 05:43:25 replica1.ipa.test systemd[1]: dirsrv@IPA-TEST.service: Failed with result 'signal'.

In contrast to Fedora, ALTLinux's nspr is buit with --enable-debug, which enables all the assertions. The aforementioned function in ptsynch.c:648 is PR_ExitMonitor:

 637 PR_IMPLEMENT(PRStatus) PR_ExitMonitor(PRMonitor *mon)                           
 638 {                                                                               
 639     pthread_t self = pthread_self();                                            
 640     PRIntn rv;                                                                  
 641     PRBool notifyEntryWaiter = PR_FALSE;                                        
 642     PRIntn notifyTimes = 0;                                                     
 643                                                                                 
 644     PR_ASSERT(mon != NULL);                                                     
 645     rv = pthread_mutex_lock(&mon->lock);                                        
 646     PR_ASSERT(0 == rv);                                                         
 647     /* the entries should be > 0 and we'd better be the owner */                
 648     PR_ASSERT(mon->entryCount > 0);                                             
 649     PR_ASSERT(pthread_equal(mon->owner, self));                                 
 650     if (mon->entryCount == 0 || !pthread_equal(mon->owner, self))               
 651     {                                                                           
 652         rv = pthread_mutex_unlock(&mon->lock);                                  
 653         PR_ASSERT(0 == rv);                                                     
 654         return PR_FAILURE;                                                      
 655     }                                                                           
 656     ...

As I see from 389-ds code there is no return check for all of the invocations of PR_ExitMonitor (returns PR_SUCCESS or PR_FAILURE). What is the reason for that?

Where to dig?
Any help would be appreciated.

firstyear commented 4 years ago

I think you have some other problems first, your db is deadlocking. :(

What do you think @mreynolds ? It looks like we have a condition where we can try to exit a monitor we don't hold as we are deadlock, and aren't accounting for that ....

I think the reason is that PR_ASSERT is compiled out in non-debug, so wrapping all the monitors in asserts would cause us to have no locks at all. We could use a different assert type, or try to handle these errors better, but C is not a language conducive to clean error handling. And then we have to choose how to actually proceed in this situation, it seems like your server is not in a really healthy state right now ...

Metadata Update from @firstyear:
- Custom field origin adjusted to None
- Custom field reviewstatus adjusted to None

4 years ago

slev commented 4 years ago

@firstyear, this isn't a production server, but CI. I run FreeIPA integration tests the same as upstream's ones. The difference here is OS. At least, on Fedora 31 I don't see this problem.
I could provide more details (e.g. logs), just tell me, please.

firstyear commented 4 years ago

Can you please tell me the version of your libdb package you have on that distribution?

slev commented 4 years ago

Sure,

# rpm -q --queryformat '%{VERSION}\n' libdb4.7
4.7.25

tbordaz commented 4 years ago

@slev, the abort looks possibly related to a replication dynamic corner case. An update is failing because DS fails to log it into the changelog and at the same time a replication agreement is fetching the changelog to find the next update to send.
If ABORT is related to replication would you please retry running with replication logging on 'nsslapd-errorlog-level: 8192'.
Also did the ABORT dumped a core or can you configure the box to allow DS to dump a core ? The difficulty being to know which monitor is not balanced and trigger an sigabrt.

tbordaz commented 4 years ago

@slev also, how frequently does it happen ? Could you test on 1.4.3 to check if is still happens

firstyear commented 4 years ago

I think there are possibly two issues here. As @tbordaz there is probably a replication corner case when the BE deadlocks causing this.

The second is that your version of libdb probably doesn't have a series of patches applied. Honestly, I myself need to get the suse packages updated and patched with some fixes that Red Hat have and have released. So maybe it's worth getting in touch with the maintainer of libdb in your distro and then we can get some of these fixes applied.

slev commented 4 years ago

@tbordaz, sorry for delay.
I built 1.4.1.16 with default 'nsslapd-errorlog-level: 8192'.

core:

logs:

About randomness:
Every test which requires replication fails at ~(20-30)%.
But this one (attached) fails always on my local machine.

slev commented 4 years ago

@firstyear, what is the preferred 389-ds-1.4.3 or Fedora's libdb?
My distro has libdb4.7 and libdb6.1, but I can build the same as Fedora has (5.3.28 + patches).

tbordaz commented 4 years ago

@slev, is it possible to dump the crashing thread from the core ? Thanks

firstyear commented 4 years ago

I think the fedora/centos libdb 5 version with patches is what you want. You can't use libdb6 with 389-ds.

slev commented 4 years ago

@tbordaz,
all threads are dumped.

Edited 4 years ago by slev

tbordaz commented 4 years ago

@slev, the problem looks to be UNBALANCED Monitor lock/unlock in entry cache :( . (nice catch !)

Thread 1 (Thread 0x7f4fefdff700 (LWP 5244)):
#0  0x00007f5017029891 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007f501701355b in __GI_abort () at abort.c:79
#2  0x00007f50172bbbad in ??? () at /usr/lib64/libnspr4.so
#3  0x00007f4ff18ce8e0 in ??? ()
#4  0x00007f4fefdfc240 in ??? ()
#5  0x00007f4ff18351c0 in ??? ()
#6  0x00007f4fefdff700 in ??? ()
#7  0x00007f4ff18351c8 in ??? ()
#8  0x00007f50172d1610 in PR_CancelJob (jobp=0x288) at prtpool.c:1077
#9  0x00007f501613e3c8 in ??? ()
#10 0x00007f4fefdfc280 in ??? ()
#11 0x00007f50133aae9d in modify_term (mc=0x7f501613e3c8, mc@entry=0x7f4fefdfc240, be=<optimized out>) at ldap/servers/slapd/back-ldbm/ldbm_modify.c:77
#12 0x00007f501338f978 in ldbm_back_add (pb=0x7f4ff18017c0) at ldap/servers/slapd/back-ldbm/ldbm_add.c:1414
#13 0x00007f501754ad65 in op_shared_add (pb=pb@entry=0x7f4ff18017c0) at ldap/servers/slapd/add.c:689
#14 0x00007f501754b1df in add_internal_pb (pb=pb@entry=0x7f4ff18017c0) at ldap/servers/slapd/add.c:407
#15 0x00007f501754bf77 in slapi_add_internal_pb (pb=pb@entry=0x7f4ff18017c0) at ldap/servers/slapd/add.c:331
#16 0x00007f501332ed92 in mep_add_managed_entry (config=0x7f5012b492f0, origin=0x7f4ff1834740) at ldap/servers/plugins/mep/mep.c:1422
#17 0x00007f50133315ed in mep_add_post_op (pb=0x7f4ff1801580) at ldap/servers/plugins/mep/mep.c:2509
#18 0x00007f50175ac759 in plugin_call_func (list=0x7f501365a900, operation=operation@entry=560, pb=pb@entry=0x7f4ff1801580, call_one=call_one@entry=0) at ldap/servers/slapd/plugin.c:2031
#19 0x00007f50175ac993 in plugin_call_list (pb=0x7f4ff1801580, operation=560, list=<optimized out>) at ldap/servers/slapd/plugin.c:1972
#20 0x00007f50175ac993 in plugin_call_plugins (pb=pb@entry=0x7f4ff1801580, whichfunction=whichfunction@entry=560) at ldap/servers/slapd/plugin.c:442
#21 0x00007f5013390b3a in ldbm_back_add (pb=0x7f4ff1801580) at ldap/servers/slapd/back-ldbm/ldbm_add.c:1208
#22 0x00007f501754ad65 in op_shared_add (pb=pb@entry=0x7f4ff1801580) at ldap/servers/slapd/add.c:689
#23 0x00007f501754bdc2 in do_add (pb=pb@entry=0x7f4ff1801580) at ldap/servers/slapd/add.c:236
#24 0x00000000004199a4 in connection_dispatch_operation (pb=0x7f4ff1801580, op=0x7f5012981800, conn=0x7f500fc5c050) at ldap/servers/slapd/connection.c:623
#25 0x00000000004199a4 in connection_threadmain () at ldap/servers/slapd/connection.c:1785
#26 0x00007f50172d7cce in _pr_poll_with_poll (pds=0x7ffe8380e17e, npds=5244, timeout=295831040) at ptio.c:4111
#27 0x0000000000000000 in ??? ()

More specifically when adding user "uid=testuser2,cn=users,cn=accounts,dc=ipa,dc=test", MEP plugin adds "cn=testuser2,cn=groups,cn=accounts,dc=ipa,dc=test". This MEP add updates the parent entry ("cn=groups,cn=accounts,dc=ipa,dc=test") for numsubordinates.
The MEP ADD failed (ldap_result_code=LDAP_OPERATIONS_ERROR) that is likely a consequence/side effect of DB_DEADLOCK.
During MEP ADD cleanup it release the Monitor lock on the parent entry that is the entry cache but at this time the the Monitor lock have not been acquired !

I think a condition to reproduce is the nice idea to build nspr with --enable_debug. It catches the unbalance Monitor during a DB_deadlock.
Next is how to reproduce the db_deadlock, I think it is not easy. One possibility I can think of is having a large number of replication agreement that can be in competition with a write load. Did you identify a reproducible test case ?

tbordaz commented 4 years ago

Just for recording, 'parententry' was locked in entry cache (find_entry2modify_only) so I suspect that at some point 'parententry' can be replaced from modify context 'parent_modify_c' with an unlocked entry (possibly modify_switch_entries or modify_unswitch_entries)

Metadata Update from @mreynolds:
- Issue priority set to: major
- Issue set to the milestone: 1.4.1

4 years ago

Metadata Update from @mreynolds:
- Issue set to the milestone: 1.4.3 (was: 1.4.1)

3 years ago

slev commented 3 years ago

With libdb5.3 the aforementioned deadlock disappears at all, while libdb6.1 introduces another issue. So the former is our choice. Thank you all for the kind support!

Closing.

Metadata Update from @slev:
- Issue close_status updated to: fixed
- Issue status updated to: Closed (was: Open)

3 years ago

firstyear commented 3 years ago

@slev we can't use libdb 6 due to AGPL taint.

spichugi commented 3 years ago

389-ds-base is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in 389-ds-base's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/389ds/389-ds-base/issues/4020

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata Update from @spichugi:
- Issue close_status updated to: wontfix (was: fixed)

3 years ago

Metadata

Assignee

None

Tags

None

Blocking

None

Depending on

None

Priority

major

Milestone

1.4.3

reviewstatus

None

rhbz

None

origin

None