#48636 Improve MMR replication convergence
Closed: wontfix None Opened 8 years ago by mreynolds.

Replication latency, especially over a WAN, can become worse when there are several masters receiving updates at the same time. What happens is that one master will take exclusive access of a replica, and not release it for a very long time. This blocks the other masters from sending their updates to that consumer, and this adds to the replication latency as those updates have to travel back and forth with all the other masters, and consumers. See the bugzilla for more detailed info.

We need a way to notify a master that it is holding its exclusive access of a replica for too long, and that it needs to yield so other masters can start sending some of their updates to that replica.


The patch 0001-Ticket-48636-Improve-replication-convergence.patch​ looks good to me.

Maybe this is me just not reading it properly, but where / when do we reset the backoff counter?

Otherwise, looks good!

Replying to [comment:10 firstyear]:

Maybe this is me just not reading it properly, but where / when do we reset the backoff counter?

Otherwise, looks good!

The backoff code was already there, I just made sure the "busy backoff" time was used when send_updates() returns "yield":

{{{
use_busy_backoff_timer = PR_TRUE;
}}}

So each replication agreement has its own thread: repl5_inc_run(), when this thread goes into a backoff state(STATE_BACKOFF_START) it calculates the backoff time depending on certain conditions.

Internal stress testing (rel15) showed a significant improvement with the replication convergence distribution. In the internal testing the highest lag time was ~5 minutes, while without the fix the lag times could reach several hours (8+ hours in some cases).

Committing to master for now. If IPA tests go well then it will be back-ported to 1.2.11.

56ad9a1..a1545cd master -> master
commit a1545cd
Author: Mark Reynolds mreynolds@redhat.com
Date: Wed Jun 8 13:06:46 2016 -0400

Thanks to William for his review too(forgot to mention him in the commit - sorry)

Fixed config validation check

8133eaf..b8239e0 389-ds-base-1.3.4 -> 389-ds-base-1.3.4
commit b8239e0
commit a085b0c

62052f7..f223b50 389-ds-base-1.3.3 -> 389-ds-base-1.3.3
commit f223b50
f223b50..c66ba8c 389-ds-base-1.3.3 -> 389-ds-base-1.3.3
commit c66ba8c

68166d4..0c3f203 389-ds-base-1.2.11 -> 389-ds-base-1.2.11
commit 0c3f203
commit 2fc070e

Fix cherry-pick errors in schema file:

0c3f203..e16b83b 389-ds-base-1.2.11 -> 389-ds-base-1.2.11
commit e16b83b

Metadata Update from @mreynolds:
- Issue assigned to mreynolds
- Issue set to the milestone: 1.2.11.33

7 years ago

389-ds-base is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in 389-ds-base's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/389ds/389-ds-base/issues/1786

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata Update from @spichugi:
- Issue close_status updated to: wontfix (was: Fixed)

4 years ago

Log in to comment on this ticket.

Metadata