Replication latency, especially over a WAN, can become worse when there are several masters receiving updates at the same time. What happens is that one master will take exclusive access of a replica, and not release it for a very long time. This blocks the other masters from sending their updates to that consumer, and this adds to the replication latency as those updates have to travel back and forth with all the other masters, and consumers. See the bugzilla for more detailed info.
We need a way to notify a master that it is holding its exclusive access of a replica for too long, and that it needs to yield so other masters can start sending some of their updates to that replica.
attachment 0001-Ticket-48636-Improve-replication-convergence.patch
The patch 0001-Ticket-48636-Improve-replication-convergence.patch​ looks good to me.
design doc:
http://www.port389.org/docs/389ds/design/repl-conv-design.html
Maybe this is me just not reading it properly, but where / when do we reset the backoff counter?
Otherwise, looks good!
Replying to [comment:10 firstyear]:
Maybe this is me just not reading it properly, but where / when do we reset the backoff counter? Otherwise, looks good!
The backoff code was already there, I just made sure the "busy backoff" time was used when send_updates() returns "yield":
{{{ use_busy_backoff_timer = PR_TRUE; }}}
So each replication agreement has its own thread: repl5_inc_run(), when this thread goes into a backoff state(STATE_BACKOFF_START) it calculates the backoff time depending on certain conditions.
Internal stress testing (rel15) showed a significant improvement with the replication convergence distribution. In the internal testing the highest lag time was ~5 minutes, while without the fix the lag times could reach several hours (8+ hours in some cases).
Committing to master for now. If IPA tests go well then it will be back-ported to 1.2.11.
56ad9a1..a1545cd master -> master commit a1545cd Author: Mark Reynolds mreynolds@redhat.com Date: Wed Jun 8 13:06:46 2016 -0400
Thanks to William for his review too(forgot to mention him in the commit - sorry)
Fix config validation check 0001-Ticket-48636-Fix-config-validation-check.patch
Fixed config validation check
a2c2bc1..43d5ac6 master -> master commit 43d5ac6
Ticket has been cloned to Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1349571
8133eaf..b8239e0 389-ds-base-1.3.4 -> 389-ds-base-1.3.4 commit b8239e0 commit a085b0c
62052f7..f223b50 389-ds-base-1.3.3 -> 389-ds-base-1.3.3 commit f223b50 f223b50..c66ba8c 389-ds-base-1.3.3 -> 389-ds-base-1.3.3 commit c66ba8c
68166d4..0c3f203 389-ds-base-1.2.11 -> 389-ds-base-1.2.11 commit 0c3f203 commit 2fc070e
Fix cherry-pick errors in schema file:
0c3f203..e16b83b 389-ds-base-1.2.11 -> 389-ds-base-1.2.11 commit e16b83b
Metadata Update from @mreynolds: - Issue assigned to mreynolds - Issue set to the milestone: 1.2.11.33
389-ds-base is moving from Pagure to Github. This means that new issues and pull requests will be accepted only in 389-ds-base's github repository.
This issue has been cloned to Github and is available here: - https://github.com/389ds/389-ds-base/issues/1786
If you want to receive further updates on the issue, please navigate to the github issue and click on subscribe button.
subscribe
Thank you for understanding. We apologize for all inconvenience.
Metadata Update from @spichugi: - Issue close_status updated to: wontfix (was: Fixed)
Log in to comment on this ticket.