#49413 Changelog trimming ignores disabled replica-agreement and can erase updates not yet replicated
Opened 2 years ago by tbordaz. Modified 2 years ago

Issue Description

The issue is related to changelog trimming.
The updates are trimmed at the condition they are older than maxage AND have been replicated to all consumers.
The set of consumers taken into consideration is those that have an enabled replica agreement.

So if for any reason a replica agreement is disabled for a short period of time, and trimming thread runs at that time, then latest updates known by the consumer can be trimmed and replication breaks

Package Version and Platform

any version

Steps to reproduce

attached test case

Actual results

Replication breaks

Expected results

Replication should not break

Metadata Update from @tbordaz:
- Custom field component adjusted to None
- Custom field origin adjusted to None
- Custom field reviewstatus adjusted to None
- Custom field type adjusted to None
- Custom field version adjusted to None

2 years ago

Metadata Update from @mreynolds:
- Issue set to the milestone: 1.4 backlog

2 years ago

Metadata Update from @lkrispen:
- Issue assigned to lkrispen

2 years ago

The patch is good.
Do you think it worth improving it with additional conditions ?

  • tests of modifytimestamp/createtimestamp ? For example if the agreement is disabled from longer than the changelog purge delay, then it is safe to ignore that replica agreement to compute the ruv to purge.
  • tests of nsds5replicaLastUpdateStart ? For example if the RA have been in constant backoff for longer that changelog purge delay (likely the consumer is unreachable), then it is safe to ignore that RA

I see your point. you want to trim as much as possibe.

I'm not sure it is worth it, it will make calculation of the purge ruv more complicated and have no guarantee it will not result in missing csns when the agmt is reenabled later.

with the proposed patch, what can happen is, is that the changelog grows until the agmt is really removed, but it can be explained and corrected.

But while writing, omething we could consider is to always take the max(consumerRuv, currentRUV - purgedelay) as purge limit. we do purge entrystate and tombstones rigidly, so maybe it should be done for the changelog as well.

I agree. Making it simple will simplify diagnostic.
We can transfer the detection of dangling RA to some external tool (it could part of the monitoring aspect)

while writing I think of an other corner case that currently prevent to purge that is when a out of sync replica remains in the topology. Nobody can update it, but the RAs to the replica will prevent purging. Again it would rather be the job of monitoring aspect.

I am not sure about your algo. We want to purge old updates (<now-purgedelay) and updates that are already known by consumers. Do you want to merge those two conditions in only one ?
If consumers/supplier are in sync, the purge limit would purge the most recent updates ?

I was probably unclear, I wa sthinking of the "other" purge delay. We have a purge delay defined in the replica used for tombstone purging and entry state purging. It defines a time span how long replication should be able resolve updates.
But then we have in the changelog a separate setting, either maxage or maxentries, which are unrelated to this. My thought was if we should not respect the pirge delay set in the replica

ah !! yes I agree

It leads to incoherency and we need to have both changelog trimming and entry state purging having the same value. I wrote a testcase for it but not created a ticket.

For now I think I agree with @lkrispen that the patch can be simple and we can add more later? because we have added the improvement, then this already improves our server state. :)

Metadata Update from @tbordaz:
- Custom field origin adjusted to IPA (was: None)
- Custom field rhbz adjusted to https://bugzilla.redhat.com/show_bug.cgi?id=1507194

2 years ago

so do we agree on the simple patch ?

Metadata Update from @lkrispen:
- Custom field reviewstatus adjusted to ack (was: None)

2 years ago

Yes I do. Taking into account all replica agreements will fix the issue.
In addition, if an old replica agreement (that needs to be cleanup) is preventing the trimming, it is fine. It is an administrative task to clear it.

To ssh://pagure.io/389-ds-base.git
339caec..575d9e2 master -> master

To ssh://pagure.io/389-ds-base.git
f20aa78..8a4b26e 389-ds-base-1.3.7 -> 389-ds-base-1.3.7

Login to comment on this ticket.