#4586 Tests: host-del returns DatabaseError
Closed: Fixed None Opened 8 years ago by mkosek.

With 389-ds-base-1.3.3.2-2.fc20.x86_64 I get:

# ./make-test ipatests/test_xmlrpc/test_cert_plugin.py
...
======================================================================
ERROR: Clean up cert test data
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/root/freeipa-master/ipatests/test_xmlrpc/test_cert_plugin.py", line 198, in test_0007_cleanup
    api.Command['host_del'](self.host_fqdn)
  File "/root/freeipa-master/ipalib/frontend.py", line 439, in __call__
    ret = self.run(*args, **options)
  File "/root/freeipa-master/ipalib/frontend.py", line 755, in run
    return self.forward(*args, **options)
  File "/root/freeipa-master/ipalib/frontend.py", line 776, in forward
    return self.Backend.rpcclient.forward(self.name, *args, **kw)
  File "/root/freeipa-master/ipalib/rpc.py", line 880, in forward
    return self._call_command(command, params)
  File "/root/freeipa-master/ipalib/rpc.py", line 857, in _call_command
    return command(*params)
  File "/root/freeipa-master/ipalib/rpc.py", line 1004, in _call
    return self.__request(name, args)
  File "/root/freeipa-master/ipalib/rpc.py", line 998, in __request
    raise error_class(message=error['message'])
DatabaseError: Operations error:

errors log:

[28/Sep/2014:18:43:18 -0400] - database error -30993
[28/Sep/2014:18:43:18 -0400] - database error -30993
[28/Sep/2014:18:43:19 -0400] - database error -30993
[28/Sep/2014:18:43:19 -0400] - database error -30993
[28/Sep/2014:18:43:19 -0400] - database error -30993
[28/Sep/2014:18:43:20 -0400] - database error -30993
[28/Sep/2014:18:43:20 -0400] - database error -30993
[28/Sep/2014:18:43:20 -0400] - database error -30993
[28/Sep/2014:18:43:20 -0400] - database error -30993
[28/Sep/2014:18:43:20 -0400] - database error -30993
[28/Sep/2014:18:43:21 -0400] - database error -30993
[28/Sep/2014:18:43:21 -0400] - database error -30993
[28/Sep/2014:18:43:21 -0400] - database error -30993
[28/Sep/2014:18:43:22 -0400] - database error -30993
[28/Sep/2014:18:43:22 -0400] - database error -30993
[28/Sep/2014:18:43:22 -0400] - Operation error fetching Null DN (c970c1b8-476011e4-9f61c45a-c8138913),  error -30993.
[28/Sep/2014:18:43:22 -0400] - dn2entry_ext: Failed to get id for changenumber=7523,cn=changelog from   entryrdn index (-30993)
[28/Sep/2014:18:43:22 -0400] - Operation error fetching changenumber=7523,cn=changelog (null), error -  30993.
[28/Sep/2014:18:43:22 -0400] DSRetroclPlugin - replog: an error occured while adding change number      7523, dn = changenumber=7523,cn=changelog: Operations error. 
[28/Sep/2014:18:43:22 -0400] retrocl-plugin - retrocl_postob: operation failure [1]

access log:

[28/Sep/2014:18:43:18 -0400] conn=2153 op=14 DEL dn="krbprincipalname=test/ipatestcert.mkosek-fedora20. test@MKOSEK-FEDORA20.TEST,cn=services,cn=accounts,dc=mkosek-fedora20,dc=test"
[28/Sep/2014:18:43:18 -0400] conn=3 op=177 SRCH base="cn=15,ou=certificateRepository,ou=ca,o=ipaca"     scope=0 filter="(objectClass=*)" attrs=ALL
[28/Sep/2014:18:43:18 -0400] conn=3 op=177 RESULT err=0 tag=101 nentries=1 etime=0
[28/Sep/2014:18:43:18 -0400] conn=3 op=178 SRCH base="cn=19,ou=ca,ou=requests,o=ipaca" scope=0          filter="(objectClass=*)" attrs=ALL
[28/Sep/2014:18:43:18 -0400] conn=3 op=178 RESULT err=0 tag=101 nentries=1 etime=0
[28/Sep/2014:18:43:18 -0400] conn=3 op=179 MOD dn="ou=ca,ou=requests,o=ipaca"
[28/Sep/2014:18:43:18 -0400] conn=2153 op=15 SRCH base="fqdn=ipatestcert.mkosek-fedora20.test,          cn=computers,cn=accounts,dc=mkosek-fedora20,dc=test" scope=0 filter="(objectClass=*)"                   attrs="userCertificate"
[28/Sep/2014:18:43:18 -0400] conn=2153 op=15 RESULT err=0 tag=101 nentries=1 etime=0
[28/Sep/2014:18:43:18 -0400] conn=2153 op=16 DEL dn="fqdn=ipatestcert.mkosek-fedora20.test,cn=computers,cn=accounts,dc=mkosek-fedora20,dc=test"
[28/Sep/2014:18:43:18 -0400] conn=2153 op=14 RESULT err=0 tag=107 nentries=0 etime=0
[28/Sep/2014:18:43:22 -0400] conn=3 op=179 RESULT err=0 tag=103 nentries=0 etime=4
[28/Sep/2014:18:43:22 -0400] conn=2153 op=16 RESULT err=1 tag=107 nentries=0 etime=4
[28/Sep/2014:18:43:23 -0400] conn=2154 fd=136 slot=136 connection from 10.16.78.147 to 10.16.78.147
[28/Sep/2014:18:43:23 -0400] conn=2153 op=17 UNBIND

So ds version is: 1.3.3.2-2, what is the IPA version ? I assume the test script is executed from the source tree, which branch ?
To reproduce, is it sufficient to install an IPA master and the run this specific test ?

I run FreeIPA built from ipa-4-1 branch, but I think any 4.0+ version will be sufficient as this script did not change much lately.

The reproduction happened for me exactly as you described it.

I can also reproduce the problem on a VM (vm-046.idm.lab.bos.redhat.com).

[29/Sep/2014:10:38:57 -0400] - database error -30993
[29/Sep/2014:10:38:57 -0400] - database error -30993
[29/Sep/2014:10:38:57 -0400] - Operation error fetching Null DN (57db5683-47e611e4-b8f897c9-2682f804), error -30993.
[29/Sep/2014:10:38:57 -0400] - dn2entry_ext: Failed to get id for changenumber=4873,cn=changelog from entryrdn index (-30993)
[29/Sep/2014:10:38:57 -0400] - Operation error fetching changenumber=4873,cn=changelog (null), error -30993.
[29/Sep/2014:10:38:57 -0400] DSRetroclPlugin - replog: an error occured while adding change number 4873, dn = changenumber=4873,cn=changelog: Operations error. 
[29/Sep/2014:10:38:57 -0400] retrocl-plugin - retrocl_postob: operation failure [1]
[29/Sep/2014:10:38:58 -0400] - database error -30993
[29/Sep/2014:10:38:58 -0400] dna-plugin - dna_get_shared_servers: search failed for shared config: cn=posix-ids,cn=dna,cn=ipa,cn=etc,dc=example,dc=com [error 1]

DB returns a DB_DEADLOCK during a ADD while updating entryrdn index.
This is looking like transient error and DS should retry.

I think it does retry, but the retry count will be exceeded.

Note that it is a failure in the retro cl, in the original post from Martin, we have two concurrent operations on two backends, which share the changelog backend. And both ops took 4seconds.

I will look into your installation to verify it's the same pattern

That is correct max retry count exceeded (enable db error log):

[29/Sep/2014:11:14:01 -0400] - index read retrying transaction WARNING 1045, err=-30993 BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock
[29/Sep/2014:11:14:01 -0400] - idl_new.c WARNING 2, err=-30993 BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock
[29/Sep/2014:11:14:01 -0400] - index read retrying transaction WARNING 1045, err=-30993 BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock
[29/Sep/2014:11:14:01 -0400] - index_read retry count exceeded WARNING 1046, err=-30993 BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock
[29/Sep/2014:11:14:01 -0400] - Operation error fetching Null DN (3b8de898-47eb11e4-b8f897c9-2682f804), error -30993.
[29/Sep/2014:11:14:01 -0400] entryrdn-index - _entryrdn_get_elem: cursor get deadlock while under txn -> failure
[29/Sep/2014:11:14:01 -0400] entryrdn-index - _entryrdn_index_read: Suffix "cn=changelog" cursor get fails: BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock(-30993)
[29/Sep/2014:11:14:01 -0400] - dn2entry_ext: Failed to get id for changenumber=5007,cn=changelog from entryrdn index (-30993)
[29/Sep/2014:11:14:01 -0400] - Operation error fetching changenumber=5007,cn=changelog (null), error -30993.
[29/Sep/2014:11:14:01 -0400] ldbm_back_add - conn=0 op=0 modify_term: old_entry=0x0, new_entry=0x0
[29/Sep/2014:11:14:01 -0400] DSRetroclPlugin - replog: an error occured while adding change number 5007, dn = changenumber=5007,cn=changelog: Operations error.
[29/Sep/2014:11:14:01 -0400] retrocl-plugin - retrocl_postob: operation failure [1]

I rebuilt a version without DB_TXN_NOWAIT and tried to reproduce. It happended less frequently but after several attempts I hit a deadlock. Threads possibly involved are :

Thread 17 (Thread 0x7fadb0ff9700 (LWP 1670)):
#0  0x00007faddcab556d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007faddcab11a1 in _L_lock_1022 () from /lib64/libpthread.so.0
#2  0x00007faddcab1142 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007faddd109339 in PR_Lock () from /lib64/libnspr4.so
#4  0x00007fadd1c7f9d0 in retrocl_postob () from /usr/lib64/dirsrv/plugins/libretrocl-plugin.so
#5  0x00007fadded1fc50 in plugin_call_func () from /usr/lib64/dirsrv/libslapd.so.0
#6  0x00007fadded1fea8 in plugin_call_plugins () from /usr/lib64/dirsrv/libslapd.so.0
#7  0x00007fadd33fe50e in ldbm_back_delete () from /usr/lib64/dirsrv/plugins/libback-ldbm.so
#8  0x00007faddecd3e80 in op_shared_delete () from /usr/lib64/dirsrv/libslapd.so.0
#9  0x00007faddecd4143 in do_delete () from /usr/lib64/dirsrv/libslapd.so.0
#10 0x00007faddf1ec16e in connection_threadmain ()
#11 0x00007faddd10ee3b in _pt_root () from /lib64/libnspr4.so
#12 0x00007faddcaaef35 in start_thread () from /lib64/libpthread.so.0
#13 0x00007faddc7ddc3d in clone () from /lib64/libc.so.6

Thread 9 (Thread 0x7fadacff1700 (LWP 1678)):
#0  0x00007faddcab2cf0 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fadd69b31c3 in __db_hybrid_mutex_suspend () from /lib64/libdb-5.3.so
#2  0x00007fadd69b25a8 in __db_tas_mutex_lock () from /lib64/libdb-5.3.so
#3  0x00007fadd6a5cfda in __lock_get_internal () from /lib64/libdb-5.3.so
#4  0x00007fadd6a5dac0 in __lock_get () from /lib64/libdb-5.3.so
#5  0x00007fadd6a896da in __db_lget () from /lib64/libdb-5.3.so
#6  0x00007fadd69d04a7 in __bam_search () from /lib64/libdb-5.3.so
#7  0x00007fadd69bb126 in __bamc_search () from /lib64/libdb-5.3.so
#8  0x00007fadd69bcbdf in __bamc_get () from /lib64/libdb-5.3.so
#9  0x00007fadd6a76156 in __dbc_iget () from /lib64/libdb-5.3.so
#10 0x00007fadd6a850b4 in __dbc_get_pp () from /lib64/libdb-5.3.so
#11 0x00007fadd33dd1f0 in idl_new_fetch () from /usr/lib64/dirsrv/plugins/libback-ldbm.so
#12 0x00007fadd33eb656 in index_read_ext_allids () from /usr/lib64/dirsrv/plugins/libback-ldbm.so
#13 0x00007fadd33d5e44 in keys2idl () from /usr/lib64/dirsrv/plugins/libback-ldbm.so
#14 0x00007fadd33d65a3 in ava_candidates.isra () from /usr/lib64/dirsrv/plugins/libback-ldbm.so
#15 0x00007fadd33d6b92 in filter_candidates_ext () from /usr/lib64/dirsrv/plugins/libback-ldbm.so
#16 0x00007fadd33d7c06 in list_candidates () from /usr/lib64/dirsrv/plugins/libback-ldbm.so
#17 0x00007fadd33d6b00 in filter_candidates_ext () from /usr/lib64/dirsrv/plugins/libback-ldbm.so
#18 0x00007fadd33d7c06 in list_candidates () from /usr/lib64/dirsrv/plugins/libback-ldbm.so
#19 0x00007fadd33d6b00 in filter_candidates_ext () from /usr/lib64/dirsrv/plugins/libback-ldbm.so
#20 0x00007fadd33d81fa in filter_candidates () from /usr/lib64/dirsrv/plugins/libback-ldbm.so
#21 0x00007fadd34136be in ldbm_back_search () from /usr/lib64/dirsrv/plugins/libback-ldbm.so
#22 0x00007fadded139d9 in op_shared_search () from /usr/lib64/dirsrv/libslapd.so.0
#23 0x00007fadded23c2e in search_internal_callback_pb () from /usr/lib64/dirsrv/libslapd.so.0
#24 0x00007fadd1652c01 in backend_shr_update_references_cb () from /usr/lib64/dirsrv/plugins/schemacompat-plugin.so
#25 0x00007fadd166017f in map_data_foreach_map () from /usr/lib64/dirsrv/plugins/schemacompat-plugin.so
#26 0x00007fadd1650a2b in backend_shr_update_references () from /usr/lib64/dirsrv/plugins/schemacompat-plugin.so
#27 0x00007fadd1651b86 in backend_shr_add_cb.part () from /usr/lib64/dirsrv/plugins/schemacompat-plugin.so
#28 0x00007fadd1651ce1 in backend_shr_betxn_post_add_cb () from /usr/lib64/dirsrv/plugins/schemacompat-plugin.so
#29 0x00007fadded1fc50 in plugin_call_func () from /usr/lib64/dirsrv/libslapd.so.0
#30 0x00007fadded1fea8 in plugin_call_plugins () from /usr/lib64/dirsrv/libslapd.so.0
#31 0x00007fadd33f1fe4 in ldbm_back_add () from /usr/lib64/dirsrv/plugins/libback-ldbm.so
#32 0x00007faddecc6c1a in op_shared_add () from /usr/lib64/dirsrv/libslapd.so.0
#33 0x00007faddecc7473 in add_internal_pb () from /usr/lib64/dirsrv/libslapd.so.0
#34 0x00007fadd1c7ff0d in retrocl_postob () from /usr/lib64/dirsrv/plugins/libretrocl-plugin.so
#35 0x00007fadded1fc50 in plugin_call_func () from /usr/lib64/dirsrv/libslapd.so.0
#36 0x00007fadded1fea8 in plugin_call_plugins () from /usr/lib64/dirsrv/libslapd.so.0
#37 0x00007fadd340bbf1 in ldbm_back_modify () from /usr/lib64/dirsrv/plugins/libback-ldbm.so
#38 0x00007fadded0cf71 in op_shared_modify () from /usr/lib64/dirsrv/libslapd.so.0
#39 0x00007fadded0e2af in do_modify () from /usr/lib64/dirsrv/libslapd.so.0
#40 0x00007faddf1ec151 in connection_threadmain ()
#41 0x00007faddd10ee3b in _pt_root () from /lib64/libnspr4.so
#42 0x00007faddcaaef35 in start_thread () from /lib64/libpthread.so.0
#43 0x00007faddc7ddc3d in clone () from /lib64/libc.so.6

The deadlock is really involving the thread 17 and 9 that are both under transaction.
Thread 17 holding page 3 of userRoot/objectclass is waiting for a lock in retrocl_postop.

While thread 9 is waiting for the page 3. It is possible that Thread 9 is holding the retrocl_postop lock (TBC)

Thread 17 (Thread 0x7fadb0ff9700 (LWP 1670)):
80000d0b dd= 1 locks held 64   write locks 35   pid/thread 1641/140383975610112 flags 0    priority 100
...
80000d0b READ          2 HELD    userRoot/cn.db            page          6
80000d0b WRITE         1 HELD    userRoot/id2entry.db      page         75
80000d0b READ         13 HELD    userRoot/objectclass.db   page          3


Thread 9 (Thread 0x7fadacff1700 (LWP 1678)):
80000d13 dd= 0 locks held 17   write locks 13   pid/thread 1641/140383908468480 flags 0    priority 100
80000d13 READ          1 WAIT    userRoot/objectclass.db   page          3
80000d13 READ          1 HELD    userRoot/member.db        page         20
80000d13 WRITE         1 HELD    changelog/numsubordinates.db page          1
...

Original operations are on separated backends:

[29/Sep/2014:12:07:36 -0400] conn=110 op=16 DEL dn="fqdn=ipatestcert.example.com,cn=computers,cn=accounts,dc=example,dc=com"
[29/Sep/2014:12:07:36 -0400] conn=7 op=225 MOD dn="ou=ca,ou=requests,o=ipaca"


MOD (thread 9)
        c_connid = 7
        c_dn = "uid=pkidbuser,ou=people,o=ipaca"
        o_msgid = 302
        o_opid = 225

DEL (thread 17)
        udn = "uid=admin,cn=users,cn=accounts,dc=example,dc=com"
        o_opid = 16
        o_connid = 110
}

The MOD is register (ADD) into the retroCL. 
During that add, schema compt plugin is doing an internal search 
(base="cn=groups,cn=accounts,dc=example,dc=com", filter="(member=changenumber=5513,cn=changelog)")

Marking as clone todo as it may result in Bugzillas eventually.

Here is a summary of the investigation so far and the next steps.

The issue is a failed delete operation (err=1), which leads to an exception in the client. There are three contributors to the situation
1] test client
2] 389 core server
3] 389 plugins

1] the client issues a concurrent MOD on=ipaca and a DEL on the main suffix, this should not be an issue, but by a combination of behaviour in 2] and 3] one of these operations can fail, but the failure is not permanent. repeating the delete of the entry after the test case finished succeeds.
So if 2] and 3] cannot be fully resolved one should reconsider the behaviour of the client.

2] each suffix has its own backend, a BDB database. Modifications on one backend are protected against each other by database transactions and a backend lock at directory server level. if a thread cannot get the lock on a database page there are two modes to behave, either it waits on a lock or it returns db_deadlock and the thread retries the access to the lock for a configured number of retry times. Current DS id configured (at compile time) to use the retry mode and if the retry count is exceeded aborts the operation and returns operation error (err=1).

3] the behaviour described in 2] intends to isolate access to one backend so that deadlocks or retry count exceeds should not occur. But the plugins can bypass this assumption. The can and do try to access other pages in other backends than the one the operation which calls them operates on.

In the current case the first blocking issue was the scheam-compat plugin, which did searches for other backends (even for mods in the retro changelog). This could be resolved by configuring the "ignore-subtree" propoerly for all schema-compat plugin instances.
But now the locking problem has moved to the DNA plugin, where it tries for a MOD in o=ipaca to access pages in the other backend (dc=example). There is a dnaScope configured to restrict teh plugin to dc=example, but looks like it is not effectively working.
And if the dna plugin can be changed tehre is no guarantee that not another one will cause the same issues.

What to do:

  • continue to resolve the issue in the dna plugin
  • think about a general solution: in my opinion this can only be to extend the scope of the backend lock to lock access to all backends, not only to a single backend. But the effect would be a further reduction of parallel modify operations, the impact needs to be evaluated
  • rethink the behaviour on the client side, accept that operations can fail and have to be repeated

the behaviour of the dna plugin is even a bug, it does the internal searches when it shouldn't and vice versa. cf DS ticket #47918.

Thanks for detailed investigation. Is this all a regression caused by some DS changed? As I do not recall I would see this failing earlier, in 1.3.2.*.

I am not fan of client side changes (read - workarounds), I think DS operations should finish successfully if IPA client is not doing any restricted operation. So I would rather fix in the affected plugins, if they are indeed not doing the operations correctly.

no, it is not really a regression. It is a core design issue in DS with multiple backends, plugins are not doing anything illegal.

There is indeed a bug in the dna plugin (since the latest major change #47779) wherer it executes too many and the wrong searches (#47918, now fixed) (this could qualify as a regression).
The other contributor is that the retro changelog is only enabled since IPA is using SyncRepl. There have been many internal schema-compat searches for entries addded to the retro changelog.
With Alexanders agreement they can be suppressed by configuring schema-compat-ignore-subtree.

With these two changes: fix of dna plugin and config change for schema compat plugin I was no longer able to reproduce the failed delete.

But to be honest there is no guarantee that there will not be new scenarios. Fully fixing this would mean:

  • either in the current implementation completely isolate modifications across backends (serialization)
  • or, when moving to new database backend (like MDB) and redesign the backend code handle this better

For now, I would suggest to have a) the dna fix in DS and b) make a patch for the configuration change for the schema compat plugin (even if it would not b erequired for this issue many, many internal searches can be avoided)

The problem is probably now more visib

Replying to [comment:14 lkrispen]:
...

With these two changes: fix of dna plugin and config change for schema compat plugin I was no longer able to reproduce the failed delete.

Ok, if we can at least minimize number of occurrences + minimize number of internal searches (-> increase performance), it is a good start.

...

For now, I would suggest to have a) the dna fix in DS and b) make a patch for the configuration change for the schema compat plugin (even if it would not b erequired for this issue many, many internal searches can be avoided)

Makes sense - please proceed.

master:

  • 08c3fe1 Ignore irrelevant subtrees in schema compat plugin

ipa-4-1:

  • 57eab1e Ignore irrelevant subtrees in schema compat plugin

ipa-4-0:

  • 86b5dce Ignore irrelevant subtrees in schema compat plugin

Together with DS fix for https://fedorahosted.org/389/ticket/47918
the issue no longer occurs.

Metadata Update from @mkosek:
- Issue assigned to lkrispen
- Issue set to the milestone: FreeIPA 4.0.4

6 years ago

Login to comment on this ticket.

Metadata