freeipa-server-4.3 (jenkins build)
The deadlock occurs because two locks (replica and dbpage) are taken in the opposite order.
One thread is processing the incoming starting replication session. The thread checks (while holding the replica lock) that the bind dn is part of the group of authorized replication managers. But because of period resync of that group it triggers a lookup of 'cn=replication managers,cn=sysaccounts,cn=etc,<domain>' that access some DB page in read (entryrdn index).
The other thread is processing a write operation that acquired the DB page (write), then a txn_be plugin triggers an internal op that needs a 'csn' (from replica csn generator). This generation of the CSN require the replica lock.
The deadlock is not systematic (because the resync of the replication managers group wait for a new replication session and has a minimum delay (60s))
In replica_updatedn_list_group_replace() we could separate the build of the new memberlist (without lock) and then do the replace with the lock
Hi Thierry, I have a question. It might be a stupid one, though... :)
This means at the line 1139, other threads could have a chance to set new r->groupdn_list and the line 1142-1144 could replace with this groupdn_list? It is guaranteed it does not occur or it is ok even if it occurs? If the answer is yes, I will ack.
1139 replica_updatedn_list_group_replace(groupdn_list, updatedn_groupds_copy);
1142 replica_updatedn_list_delete(r->groupdn_list, NULL);
1144 r->groupdn_list = groupdn_list;
Good catch !!
The key element is r->updatedn_groups (nsDS5ReplicaBindDnGroup) that is updated while holding the lock.
r->groupdn_list contains the list of members of those groups.
If during the refresh, 'nsDS5ReplicaBindDnGroup' is updated then there is a risk the updated groupdn_list is overwritten by the refresh.
It should be a transient issue as the next refresh will evaluate the correct updatedn_groups.
But if nsDS5ReplicaBindDnGroupCheckInterval is very long (or infinite), replication can break.
I think a possible fix is (after reacquiring replica lock), to check that updatedn_groupds_copy and r->updatedn_groups are identical.
Thank you, Thierry! Your take 2 patch looks nice!
Thanks Noriko for the review
git push origin '''master'''
Counting objects: 7, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (7/7), done.
Writing objects: 100% (7/7), 2.30 KiB | 0 bytes/s, done.
Total 7 (delta 5), reused 0 (delta 0)
6858dfd..5cfd3de master -> master
Metadata Update from @tbordaz:
- Issue assigned to tbordaz
- Issue set to the milestone: 220.127.116.11
to comment on this ticket.