389-ds-base-1.3.4.6-1.fc23 freeipa-server-4.3 (jenkins build)
The deadlock occurs because two locks (replica and dbpage) are taken in the opposite order. One thread is processing the incoming starting replication session. The thread checks (while holding the replica lock) that the bind dn is part of the group of authorized replication managers. But because of period resync of that group it triggers a lookup of 'cn=replication managers,cn=sysaccounts,cn=etc,<domain>' that access some DB page in read (entryrdn index).
The other thread is processing a write operation that acquired the DB page (write), then a txn_be plugin triggers an internal op that needs a 'csn' (from replica csn generator). This generation of the CSN require the replica lock.
The deadlock is not systematic (because the resync of the replication managers group wait for a new replication session and has a minimum delay (60s))
In replica_updatedn_list_group_replace() we could separate the build of the new memberlist (without lock) and then do the replace with the lock
attachment 0001-Ticket-48597-Deadlock-when-rebuilding-the-group-of-a.patch
Hi Thierry, I have a question. It might be a stupid one, though... :)
This means at the line 1139, other threads could have a chance to set new r->groupdn_list and the line 1142-1144 could replace with this groupdn_list? It is guaranteed it does not occur or it is ok even if it occurs? If the answer is yes, I will ack. {{{ 1138 replica_unlock(r->repl_lock); 1139 replica_updatedn_list_group_replace(groupdn_list, updatedn_groupds_copy); 1140 replica_lock(r->repl_lock); 1141 1142 replica_updatedn_list_delete(r->groupdn_list, NULL); 1143 replica_updatedn_list_free(r->groupdn_list); 1144 r->groupdn_list = groupdn_list; 1145 slapi_valueset_free(updatedn_groupds_copy); }}}
Hi Noriko,
Good catch !!
The key element is r->updatedn_groups (nsDS5ReplicaBindDnGroup) that is updated while holding the lock. r->groupdn_list contains the list of members of those groups.
If during the refresh, 'nsDS5ReplicaBindDnGroup' is updated then there is a risk the updated groupdn_list is overwritten by the refresh. It should be a transient issue as the next refresh will evaluate the correct updatedn_groups. But if nsDS5ReplicaBindDnGroupCheckInterval is very long (or infinite), replication can break.
I think a possible fix is (after reacquiring replica lock), to check that updatedn_groupds_copy and r->updatedn_groups are identical.
attachment 0002-Ticket-48597-Deadlock-when-rebuilding-the-group-of-a.patch
Thank you, Thierry! Your take 2 patch looks nice!
Thanks Noriko for the review
git push origin '''master''' Counting objects: 7, done. Delta compression using up to 8 threads. Compressing objects: 100% (7/7), done. Writing objects: 100% (7/7), 2.30 KiB | 0 bytes/s, done. Total 7 (delta 5), reused 0 (delta 0) To ssh://git.fedorahosted.org/git/389/ds.git 6858dfd..5cfd3de master -> master
Metadata Update from @tbordaz: - Issue assigned to tbordaz - Issue set to the milestone: 1.3.5.0
389-ds-base is moving from Pagure to Github. This means that new issues and pull requests will be accepted only in 389-ds-base's github repository.
This issue has been cloned to Github and is available here: - https://github.com/389ds/389-ds-base/issues/1784
If you want to receive further updates on the issue, please navigate to the github issue and click on subscribe button.
subscribe
Thank you for understanding. We apologize for all inconvenience.
Metadata Update from @spichugi: - Issue close_status updated to: wontfix (was: Fixed)
Log in to comment on this ticket.